Taming the Big Data Fire Hose

Slides:



Advertisements
Similar presentations
Introduction to VoltDB
Advertisements

Performance Testing - Kanwalpreet Singh.
Real-Time Big Data Use Cases John Leach CTO, Splice Machine.
The open source database you’ll never outgrow Big Data. Fast Data. June 2011 Ryan Betts, VoltDB Engineering
The NewSQL database you’ll never outgrow Taming the Big Data Fire Hose John Hugg Sr. Software Engineer, VoltDB.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
A Fast Growing Market. Interesting New Players Lyzasoft.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.
Chapter 14 The Second Component: The Database.
Microsoft SQL Server x 46% 900+ For Hosting Service Providers
HOL9396: Oracle Event Processing 12c
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
® 2007 Solid Information Technology Data Management for Automotive Standardization requirements Franz-Josef Maintz Solid Information Technology
Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
IMDGs An essential part of your architecture. About me
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Microsoft Cloud Solution.  What is the cloud?  Windows Azure  What services does it offer?  How does it all work?  How to go about using it  Further.
MAR Capability Overview Deck Protean Analytics.
SQL Server 2012 Session: 1 Session: 4 SQL Azure Data Management Using Microsoft SQL Server.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Microsoft Ignite /28/2017 6:07 PM
Data Mining & OLAP What is Data Mining? Data Mining is the set of activities used to find new, hidden, or unexpected patterns in data.
Energy Management Solution
3 Ways to Integrate Business Systems to Partners
Connected Infrastructure
CSCI5570 Large Scale Data Processing Systems
Data Platform and Analytics Foundational Training
Data Platform Modernization
Introduction to VoltDB
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Smart Building Solution
Discovering Computers 2010: Living in a Digital World Chapter 14
Windows Azure SQL Federation
Connected Maintenance Solution
Client/Server Databases and the Oracle 10g Relational Database
Chapter 14 Big Data Analytics and NoSQL
Smart Building Solution
Connected Maintenance Solution
Maximum Availability Architecture Enterprise Technology Centre.
NOSQL.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
Introduction to NewSQL
CHAPTER 3 Architectures for Distributed Systems
Energy Management Solution
Database Architectures and the Web
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Agenda VoltDB Technical Overview Comparing VoltDB to Traditional OLTP
Cloud Computing.
Big Data - in Performance Engineering
MANAGING DATA RESOURCES
Data Platform Modernization
XtremeData on the Microsoft Azure Cloud Platform:
Dark Data Are we at risk?.
Claus Busk Andersen Program Manager BI Microsoft Business Solutions
Analytics, BI & Data Integration
Customer 360.
Presentation transcript:

Taming the Big Data Fire Hose John Hugg Sr. Software Engineer, VoltDB

Big Data Defined Velocity Volume Variety Moves at very high rates (think sensor-driven systems) Valuable in its temporal, high velocity state Volume Fast-moving data creates massive historical archives Valuable for mining patterns, trends and relationships Variety Structured (logs, business transactions) Semi-structured and unstructured

Example Big Data Use Cases Data Source High-frequency operations Lower-frequency operations Capital markets Write/index all trades, store tick data Show consolidated risk across traders Call initiation request Real-time authorization Fraud detection/analysis Inbound HTTP requests Visitor logging, analysis, alerting Traffic pattern analytics Online game Rank scores: Defined intervals Player “bests” Leaderboard lookups Real-time ad trading systems Match form factor, placement criteria, bid/ask Report ad performance from exhaust stream Mobile device location sensor Location updates, QoS, transactions Analytics on transactions Financial trade monitoring Telco call data record management Website analytics, fraud detection Online gaming micro transactions Digital ad exchange services Wireless location- based services

Big Data and You Big Data and You Incoming data streams are different than traditional business apps You need to write data quickly and reliably, but … It’s not just about high speed writes You need to validate in real-time You need to count and aggregate You need to analyze in real-time You need to scale on demand You may need to transact

Big Data Management Infrastructure High Velocity High Volume Analytic Datastore NewSQL Online gaming Ad serving Sensor data Internet commerce SaaS, Web 2.0 Mobile platforms Financial trade Structured data ACID guarantees Relational/SQL Real-time analytics Unstructured data Eventual consistency Schemaless KV, document Other OLAP data stores NoSQL

Big Data Management Infrastructure High Velocity High Volume Analytic Datastore NewSQL Online gaming Ad serving Sensor data Internet commerce SaaS, Web 2.0 Mobile platforms Financial trade Other OLAP data stores NoSQL

High Velocity Data Management

High Velocity DBMS Requirements Ingest at very high speeds and rates Scale easily to meet growth and demand peaks Support integrated fault tolerance Support a wide range of real-time (or “near-time”) analytics Integrate easily with high volume analytic datastores

High Speed Data Ingestion Support millions of write operations per second at scale Read and write latencies below 50 milliseconds Provide ACID-level consistency guarantees (maybe) Support one or more well-known application interfaces SQL Key/Value Document

Scale to Meet Growth and Demand Scale-out on commodity hardware Built-in database partitioning Manual sharding and/or add-on solutions are brittle, require apps to do “heavy lifting”, and can be an operational nightmare Database must automatically implement defined partitioning strategy Application should “see” a single database instance Database should encourage scalability best practices For example, replication of reference data minimizes need for multi-partition operations

A Look Inside Partitioning select count(*) from orders where customer_id = 5 single-partition select count(*) from orders where product_id = 3 multi-partition insert into orders (customer_id, order_id, product_id) values (3,303,2) single-partition update products set product_name = ‘spork’ where product_id = 3 multi-partition 1 101 2 1 101 3 4 401 2 1 knife 2 spoon 3 fork Partition 1 2 201 1 5 501 3 5 502 2 Partition 2 3 201 1 6 601 1 6 601 2 Partition 3 table orders : customer_id (partition key) (partitioned) order_id product_id table products : product_id (replicated) product_name

Integrated Fault Tolerance Database should transparently support built-in “Tandem-style” HA Users should be able to easily increase/decrease fault tolerance levels Database should be easily and quickly recoverable in the event of severe hardware failures Database should be able to automatically detect and manage a variety of partition fault conditions Downed nodes should be “rejoinable” without the need for service windows

Partition Detection & Recovery Network fault protection Detects partition event Determines which side of fault to disable Snapshots and disables orphaned node(s) Server A Server C Server B Live node rejoin Allows “downed” nodes to rejoin live cluster Automatically re-synchs all node data Coordinates transactions during re-synch Server A Server C Server B

Real-time Analytics Database should support a wide variety of high performance reads High-frequency single-partition Lower-frequency multi-partition Common analytic queries should be optimized in the database Multi-partition aggregations, limits, etc. Database should accommodate a flexible range of relational data operations Particularly relevant to structured data

Integration with Analytic Datastores Database should offer high performance, transactional export Export should allow a wide variety of common data enrichment operations Normalize and de-normalize De-duplicate Aggregate Architecture should support loosely-coupled integrations Impedance mismatches Durability

VoltDB Export Data Flow High Velocity Database Cluster Loosely-coupled, asynchronous Queue must be durable Bi-directional durability

Summary Big Data infrastructures will usually require more than one engine High velocity engine for “fast” data Analytic engine for “deep” data Data characteristics will often determine which high velocity engine to use NewSQL is often well-suited to structured data NoSQL is often a good fit for unstructured data Choose solutions that suit your needs and are designed for interoperability