Presentation is loading. Please wait.

Presentation is loading. Please wait.

SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1.

Similar presentations


Presentation on theme: "SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1."— Presentation transcript:

1 SciDB An Open Source Data Base Project by Michael Stonebraker (and others)
1

2 Outline Why science folks are unhappy with RDBMS
How we plan to fix that The details

3 Why SciDB? “Big science” very unhappy with RDBMS Astronomy HEP Fusion
Bio Remote sensing

4 Why? Experience of Sequoia 2000 (mid 1990s)
Tried to use Postgres for science databases Failed badly…… Main science data type is an array – horribly inefficient to simulate arrays on top of tables Required features absent (provenance, uncertainty, version control) SQL operations wrong (regrid – not join)

5 Why SciDB? Net result Community wants to get behind something better
Mentality of “roll your own from the ground up” for every new science project Realization by the science community that this is long-term suicide Community wants to get behind something better Great commonality of needs among domains

6 A Little Context XLDB-1 Genesis of the need
Asilomar conference (March 2008) Small conference to generate requirements

7 A Little Context March 2008 – September 2008 Initial design completed
Fund raising Recruiting of initial team Detailed use cases specified

8 Our Partnership Science and high-end commercial folks DBMS brain trust
Who will put up some resources And review design DBMS brain trust Who will design the system, oversee its construction, and perform needed research Non-profit company Which will manage the open source project And support the resulting system May need long term funding help

9 (We are recruiting more….)
Partners – Science (We are recruiting more….) LSST astronomy project DBMS work co-ordinated by SLAC Pacific Northwest National Laboratory (PNNL) Various bio projects Lawrence Livermore National Laboratory Fusion projects UCSB Remote sensing

10 Partners -- DBMS Mike Stonebraker (MIT)
Dave DeWitt (Wisconsin -> Microsoft) Jignesh Patel (Wisconsin) Jennifer Widom (Stanford) Dave Maier (Portland State) Stan Zdonik (Brown) Sam Madden (MIT) Ugur Cetintemal (Brown) Magda Balazinska (Washington) Mike Carey (UCI)

11 Partners -- Other E-Bay Vertica Microsoft LSST SLAC
Will hit up NSF and DOE

12 The SciDB Data Model Nothing (e.g. Hadoop, Pig, Hive, …)?
Most of you have schemas Hadoop is not a good starting point Slow No HA

13 The SciDB Data Model Tables? Makes a few of you happy
Used by Sloan Sky Survey But PanStarrs (Alex Szalay) wants arrays and scalability

14 The SciDB Data Model Arrays?
Superset of tables (tables with a primary key are a 1-D array) Makes HEP, remote sensing, astronomy, oceanography folks happy But Not biology and chemistry (who wants networks and sequences)

15 The SciDB Data Model Multidimensional grids
Superset of arrays (non-uniform cells) Makes solid modeling folks happy But Complex and slow

16 SciDB Data Model Nested multidimensional arrays
Array values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id]

17 Basic Arrays Positive integer dimensions, no gaps Bounded or unbounded

18 Enhanced Arrays “Shape” function Supports irregular boundary

19 Enhanced Arrays Co-ordinate systems
User defined functions that map integers to something else E.g. mercator Use dimension notation to access, e.g. A[17,36] or A{468.2, 917.6}

20 SciDB Query Language “Parse-tree” representation of array operations
With a “binding” to: MatLab C++ Python IDL There may be more…. User extendable operations (Postgres-style)

21 Operations Standard relational ones (filter, join)
Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) Plus add your own (Postgres-style) We need science input here!!!

22 Environment and Storage
Extendable grid (cloud) of Linux machines With built-in high availability and failover And built in disaster recovery

23 In Situ Processing Operate on data with loading it
Supported by a SciDB self-describing file format And some number of adaptors, e.g. HDF-5, NetCDF Or write your own

24 Storage Model Arrays are “chunked” in storage Chunk size can vary
Chunks are partitioned across the grid Go for scalability to petabytes

25 Which Science Guys Want (These could be in RDBMS, but Aren’t)
Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) Uncertainty Data has error bars Which must be carried along in the computation (interval arithmetic) Will look at more sophisticated error models later

26 Other Features Provenance (lineage)
What calibration generated the data What was the “cooking” algorithm In general – repeatability of data derivation Supported by a command log with query facilities (interesting research problem) And redo

27 Other Features Time travel Spatial support Named versions
Don’t fix errors by overwrite I.e. keep all of the data Supported by an extra array dimension (history) Spatial support Named versions Recalibration usually handled this way Supported by allocating an array for the new version and “diffing” against its parent

28 Other Features (Optionally) integration of the real time data capture system “cooking” inside DBMS Makes provenance capture easier Sometimes important

29 Time Line Q4/08 start company, begin research activities Late 2009
Demoware available Late 2010 V1 ships

30 Project Organization (Build-it for real) CEO (Andy Palmer -- Vertica)
Project management (Bobbi Heath -- Vertica) CTO (Stonebraker)

31 Project Organization (Design and Research)
Overall co-ordination (Stonebraker, DeWitt) Storage and execution (Madden, Cetintemal) Query layer and semantics (Zdonik, Maier) Provenance (Widom, Patel) Resource management (Balazinska) Language bindings (Carey)

32 SciDB Has a Good Chance at Success
Community realizes shared infrastructure is good “Lighthouse” customers Strong team Computation goes inside the DBMS Easier to share And reuse

33 How Can You Help? Get involved!!!!


Download ppt "SciDB An Open Source Data Base Project by Michael Stonebraker (and others) 1."

Similar presentations


Ads by Google