Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Case Study - GFS.
Hadoop File Formats and Data Ingestion
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Hadoop File Formats and Data Ingestion
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.
An Introduction to HDInsight June 27 th,
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.
Distributed Time Series Database
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Monitoring with InfluxDB & Grafana
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
BIG DATA/ Hadoop Interview Questions.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Microsoft Ignite /28/2017 6:07 PM
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Image taken from: slideshare
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Real-time analytics using Kudu at petabyte scale
Redis:~ Author Anil Sharma Data Structure server.
Indexing Goals: Store large files Support multiple search keys
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Software Architecture in Practice
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Managing batch processing Transient Azure SQL Warehouse Resource
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

Physical Data Storage Stephen Dawson-Haggerty

Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile feedback - Fault detection Hadoop HDFS Applications StreamFS

Time-Series Databases Expected workload Related work Server architecture API Performance Future directions

Dent circuit meter sMAP Write Workload sMAP Sources – HTTP/REST protocol for exposing physical information – Data trickles in as its generated – Typical data rates: 1 reading/1-60s Bulk imports – Existing databases – Migrations

Read Workload Plotting engine Matlab & python adaptors for analysis Mobile apps Batch analysis Dominated by range queries Latency is important, for interactive data exploration

Page CacheLock Manager Key-Value Store Storage Alloc. Time-series Interface Bucketing RPC Compression readingdb insert resample aggregate query streaming pipeline SQL Storage mapper MySQL

Time series interface db_open() db_query(streamid, start, end) Query points in a range db_next(streamid, ref), db_prev(...) Query points near a reference time db_add(streamid, vector) Insert points into the database db_avail(streamid) Retrieve storage map db_close() All data is part of a stream, identified only by streamid A stream is a series of tuples: (timestamp, sequence, value, min, max)

Storage Manager: BDB Berkeley Database: embedded key-value store Store binary blobs using B+ trees Very mature: around since 1992, supports transactions, free-threading, replication We use version 4

RPC Evolution First: shared memory – Low latency Move to threaded TCP Google protocol buffers – zig-zag integer representation, multiple language bindings – Extensible for multiple versions

On-Disk Format All data stores perform poorly with one key per reading – index size is high – unnecessary Solution: bucket readings Excellent locality of reference with B+ tree intexes – Data sorted by streamid and timestamp – Range queries translate into mostly large sequential IOs bucket (streamid, timestamp)

Represent in memory with materialized structure – 32b/rec – Inefficient on disk – lots of repeated data, missing fields Solution: compression – First: delta encode each bucket in protocol buffer – Second: Huffman Tree or Run Length encoding (zlib) Combined compression 2x better than gzip or either one 1m rec/second compress/decompress on modest hardware On-Disk Format compress bdb page...

Other Services: Storage Mapping What is in the database? – Compute a set of tuples (start, end, n) The desired interpretation is “the data source was alive” Different data sources have different ways of maintaining this information and maintaining confidence – Sometimes you have to infer it from the data – Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!” dead or alive?

readingdb6 Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments – behind > 2 billion points in 10k streams – 12Gb on disk ~= 5b/rec including index – So... we fit in memory! Import at around 300k points/sec – We maxed out the NIC

Low Latency RPC

Compression ratios

Write load Importing old data: 150k points/secContinuous write load: pts/sec

Future thoughts A component of a cloud storage stack for physical data Hadoop adaptor: improve Mapreduce performance over Hbase solution The data is small: 2 billion points in 12GB – We can go a long time without distributing this very much – Probably necessary for reasons other than performance

THE END