What to do with Scientific Data? by Michael Stonebraker.

Slides:

Advertisements

Similar presentations

What to do with Scientific Data? Michael Stonebraker.

Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.

1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal

Kien A. Hua Division of Computer Science University of Central Florida.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

7 +/- 2 Maybe Good Ideas John Caron June (1) NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – NcML Aggregation – Access.

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1.

1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.

© Copyright 2011 John Wiley & Sons, Inc.

Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.

Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Jeremy Boyd Director – Mindscape MSDN Regional Director

Rebecca Boger Earth and Environmental Sciences Brooklyn College.

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Systems analysis and design, 6th edition Dennis, wixom, and roth

MapReduce VS Parallel DBMSs

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.

 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 

Faculty of Applied Engineering and Urban Planning Civil Engineering Department Geographic Information Systems Vector and Raster Data Models Lecture 3 Week.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.

MWA Data Capture and Archiving Dave Pallot MWA Conference Melbourne Australia 7 th December 2011.

“One Size Fits All” An Idea Whose Time Has Come and Gone by Michael Stonebraker.

Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

AZR308. Building distributed systems on an abstraction against commodity hardware at Internet scale, composed of multiple services. Distributed System.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

Chapter 18 Object Database Management Systems. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Motivation for object.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Real-time Graphics for VR Chapter 23. What is it about? In this part of the course we will look at how to render images given the constrains of VR: –we.

Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Client-Server Paradise ICOM 8015 Distributed Databases.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

David Malon*, Peter van Gemmeren, Jack Weinstein Argonne National Laboratory ACAT September 2011 An Exploration of SciDB in the.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.

Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.

Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

® Sponsored by Improving Access to Point Cloud Data 98th OGC Technical Committee Washington DC, USA 8 March 2016 Keith Ryden Esri Software Development.

Elasticity in SciDB DBMS Team Members Gunjan Sharma(MT15015) Hiya popli(MT15020)

Spatial Data Management

Evaluation of Relational Operations: Other Operations

April 30th – Scheduling / parallel

CSCI1600: Embedded and Real Time Software

Overview of big data tools

Storage Structure and Efficient File Access

Evaluation of Relational Operations: Other Techniques

CSCI1600: Embedded and Real Time Software

What to do with Scientific Data? by Michael Stonebraker

SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1.

Presentation transcript:

What to do with Scientific Data? by Michael Stonebraker

Outline Outline  Science data – what it looks like  Hardware options for deployment  Software options  RDBMS  Wrappers on RDBMS  SciDB

O(100) petabytes

LSST Data LSST Data  Raw imagery  2-D arrays of telescope readings  “Cooked” into observations  Image intensity algorithm (data clustering)  Spatial data  Further cooked into “trajectories”  Similarity query  Constrained by maximum distance

Example LSST Queries Example LSST Queries  Recook raw imagery with my algorithm  Find all observations in a spatial region  Find all trajectories that intersect a cylinder in time

Snow Cover in the Sierras Snow Cover in the Sierras

Satellite Imagery Satellite Imagery  Raw data  Array of pixels precessing around the earth  Spherical co-ordinates  Cooked into images  Typically “best” pixel over a time window  i.e. image is a composite of several passes  Further cooked into various other things  E.g. polygons of constant snow cover

Example Queries Example Queries  Recook raw data  Using a different composition algorithm  Retrieve cooked imagery in a time cylinder  Retrieve imagery which is changing at a large rate

Chemical Plant Data Chemical Plant Data  Plant is a directed graph of plumbing  Sensors at various places (1/sec observations)  Directed graph of time series  To optimize output plant runs “near the edge”  And fails every once in a while – down for a week

Chemical Plant Data Chemical Plant Data  Record all data {(time, sensor-1, … sensor-5000)}  Look for “interesting events – i.e. sensor values out of whack”  Cluster events near each other in 5000 dimension space  Idea is to identify “near-failure modes”

General Model General Model sensors Cooking Algorithm(s) (pipeline) Derived data

Traditional Wisdom Traditional Wisdom  Cooking pipeline outside DBMS  Derived data loaded into DBMS for subsequent querying

Problems with This Approach Problems with This Approach  Easy to lose track of the raw data  Cannot query the raw data  Recooking is painful in application logic – might be easier in a DBMS (stay tuned)  Provenance (meta data about the data) is often not captured  E.g. cooking parameters  E.g. sensor calibration

My preference My preference  Load the raw data into a DBMS  Cooking pipeline is a collection of user-defined functions (DBMS extensions)  Activated by triggers or a workflow management system  ALL data captured in a common system!!!

Deployment Options Deployment Options  Supercomputer/mainframe  Individual project “silos”  Internal grid (cloud behind the firewall)  External cloud (e.g. Amazon EC20

Deployment Options Deployment Options  Supercomputer/main frame  ($$$$)  Individual project “silos”  Probably what you do now….  Every silo has a system administrator and a DBA (expensive)  Generally results in poor sharing of data

Deployment Options Deployment Options  Internal grid (cloud behind the firewall)  Mimic what Google/Amazon/Yahoo/et.al do  Other report huge savings in DBA/SE costs  Does not require you buy VMware  Requires a software stack that can enforce service guarantees

Deployment Options Deployment Options  External cloud (e.g. EC2)  Amazon can “stand up” a node wildly cheaper than Exxon – economies of scale from 10K nodes to 500K nodes  Security/company policy issues will be an issue  Amazon pricing will be an issue  Likely to be the cheapest in the long run

What DBMS to Use? What DBMS to Use?  RDBMS (e.g. Oracle)  Pretty hopeless on raw data  Simulating arrays on top of tables likely to cost a factor of  Not pretty on time series data  Find me a sensor reading whose average value over the last 3 days is within 1% of the average value over the adjoining 5 sensors

What DBMS to Use? What DBMS to Use?  RDBMS (e.g. Oracle)  Spatial data may (or may not) be ok  Cylinder queries will probably not work well  2-D rectangular regions will probably be ok  Look carefully at spatial indexing support (usually R-trees)

RDBMS Summary RDBMS Summary  Wrong data model  Arrays not tables  Wrong operations  Regrid not join  Missing features  Versions, no-overwrite, provenance, support for uncertain data, …

But your mileage may vary…… But your mileage may vary……  SQLServer working well for Sloan Skyserver data base  See paper in CIDR 2009 by Jose Blakeley

How to Do Analytics (e.g.clustering) How to Do Analytics (e.g.clustering)  Suck out the data  Convert to array format  Pass to MatLab, R, SAS, …  Compute  Return answer to DBMS

Bad News Bad News  Painful  Slow  Many analysis platforms are main memory only

RDBMS Summary RDBMS Summary  Issues not likely to get fixed any time soon  Science is small compared to business data processing

Wrapper on Top of RDBMS -- MonetDB Wrapper on Top of RDBMS -- MonetDB  Arrays simulated on top of tables  Layer above RDBMS will replace SQL with something friendlier to science  But will not fix performance problems!! Bandaid solution……

RasDaMan Solution RasDaMan Solution  An array is a blob  or array is cut into chunks and stored as a collection of blobs  Array DBMS is in user-code outside DBMS  Uses RDBMS as a reliable (but slow) file system  Grid support looks especially slow

My Proposal -- SciDB My Proposal -- SciDB  Build a commercial-quality array DBMS from the ground up.

SciDB Data Model SciDB Data Model  Nested multidimensional arrays  Augmented with co-ordinate systems (floating point dimensions)  Ragged arrays  Array values are a tuple of values and arrays

Data Storage Optimized for both dense and sparse array data  Different data storage, compression, and access Arrays are “chunked” (in multiple dimensions) Chunks are partitioned across a collection of nodes Chunks have ‘overlap’ to support neighborhood operations Replication provides efficiency and back-up Fast access to data sliced along any dimension  Without materialized views

SciDB DDL CREATE ARRAY Test_Array < A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic(); chunk size 1000 overlap 10 attribute names A, B, C index names I, J

Array Query Language (AQL) Array Query Language (AQL)  Array data management (e.g. filter, aggregate, join, etc.)  Stat operations (multiply, QR factor, etc.)  Parallel, disk-oriented  User-defined operators (Postgres-style)  Interface to external stat packages (e.g. R)

Array Query Language (AQL) SELECT Geo-Mean ( T.B ) FROM Test_Array T WHERE T.I BETWEEN :C1 AND :C2 AND T.J BETWEEN :C3 AND :C4 AND T.A = 10 GROUP BY T.I; User-defined aggregate on an attribute B in array T Subsample Filter Group-by So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL

Matrix Multiply Smaller of the two arrays is replicated at all nodes  Scatter-gather Each node does its “core” of the bigger array with the replicated smaller one Produces a distributed answer CREATE ARRAY TS_Data [ I=0:99999,1000,0, J=0:3999,100,0 ] Select multiply (TS_data.A1, test_array.B)

Architecture Shared nothing cluster  10’s–1000’s of nodes  Commodity hardware  TCP/IP between nodes  Linear scale-up Each node has a processor and storage Queries refer to arrays as if not distributed Query planner optimizes queries for efficient data access & processing Query plan runs on a node’s local executor&storage manager Runtime supervisor coordinates execution Doesn’t require JDBC, ODBC AQL an extension of SQL Also supports UDFs Java, C++, whatever…

Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t)  Uncertainty  Data has error bars  Which must be carried along in the computation (interval arithmetic)

Other Features Other Features  Time travel  Don’t fix errors by overwrite  I.e. keep all of the data  Named versions  Recalibration usually handled this way

Other Features Other Features  Provenance (lineage)  What calibration generated the data  What was the “cooking” algorithm  In general – repeatability of data derivation