Presentation is loading. Please wait.

Presentation is loading. Please wait.

Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile.

Similar presentations


Presentation on theme: "Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile."— Presentation transcript:

1 Physical Data Storage Stephen Dawson-Haggerty

2 Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile feedback - Fault detection Hadoop HDFS Applications StreamFS

3 Time-Series Databases Expected workload Related work Server architecture API Performance Future directions

4 Dent circuit meter sMAP Write Workload sMAP Sources – HTTP/REST protocol for exposing physical information – Data trickles in as its generated – Typical data rates: 1 reading/1-60s Bulk imports – Existing databases – Migrations

5 Read Workload Plotting engine Matlab & python adaptors for analysis Mobile apps Batch analysis Dominated by range queries Latency is important, for interactive data exploration

6 Page CacheLock Manager Key-Value Store Storage Alloc. Time-series Interface Bucketing RPC Compression readingdb insert resample aggregate query streaming pipeline SQL Storage mapper MySQL

7 Time series interface db_open() db_query(streamid, start, end) Query points in a range db_next(streamid, ref), db_prev(...) Query points near a reference time db_add(streamid, vector) Insert points into the database db_avail(streamid) Retrieve storage map db_close() All data is part of a stream, identified only by streamid A stream is a series of tuples: (timestamp, sequence, value, min, max)

8 Storage Manager: BDB Berkeley Database: embedded key-value store Store binary blobs using B+ trees Very mature: around since 1992, supports transactions, free-threading, replication We use version 4

9 RPC Evolution First: shared memory – Low latency Move to threaded TCP Google protocol buffers – zig-zag integer representation, multiple language bindings – Extensible for multiple versions

10 On-Disk Format All data stores perform poorly with one key per reading – index size is high – unnecessary Solution: bucket readings Excellent locality of reference with B+ tree intexes – Data sorted by streamid and timestamp – Range queries translate into mostly large sequential IOs bucket (streamid, timestamp)

11 Represent in memory with materialized structure – 32b/rec – Inefficient on disk – lots of repeated data, missing fields Solution: compression – First: delta encode each bucket in protocol buffer – Second: Huffman Tree or Run Length encoding (zlib) Combined compression 2x better than gzip or either one 1m rec/second compress/decompress on modest hardware On-Disk Format compress bdb page...

12 Other Services: Storage Mapping What is in the database? – Compute a set of tuples (start, end, n) The desired interpretation is “the data source was alive” Different data sources have different ways of maintaining this information and maintaining confidence – Sometimes you have to infer it from the data – Sometime data sources give you liveness/presence guarantees – “I haven’t heard from you in an hour, but I’m still alive!” dead or alive?

13 readingdb6 Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments – behind www.openbms.org > 2 billion points in 10k streams – 12Gb on disk ~= 5b/rec including index – So... we fit in memory! Import at around 300k points/sec – We maxed out the NIC

14 Low Latency RPC

15 Compression ratios

16 Write load Importing old data: 150k points/secContinuous write load: 300-500pts/sec

17 Future thoughts A component of a cloud storage stack for physical data Hadoop adaptor: improve Mapreduce performance over Hbase solution The data is small: 2 billion points in 12GB – We can go a long time without distributing this very much – Probably necessary for reasons other than performance

18 THE END


Download ppt "Physical Data Storage Stephen Dawson-Haggerty. Data Sources sMAP - Data exploration/visualization - Control Loops - Demand response - Analytics - Mobile."

Similar presentations


Ads by Google