Presentation on theme: "Notes on Geographic Information Systems, DBMS Technology, and SciDB"— Presentation transcript:
1 Notes on Geographic Information Systems, DBMS Technology, and SciDB A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads (Computer Scientists)Dr. Paul G. BrownParadigm4 / SciDB
2 Overview of Talk How did we get here? Pioneers and Pilgrims DBMS, GIS, Scientific and Statistical Data ManagementPioneers and PilgrimsDark Ages: Pre-Internet, Pre-Web, Sneaker-net and boxes full of CDsSequoia 2000Jim Gray, and the SLOAN Digital Sky SurveyXLDB ConferencesScience and Its MethodsWhy your skill sets will become lucrative (not just important).Quick Overview of SciDBWe are witnessing the rise of the Scientific Data Management System: a category of applications that draw on the lessons of traditional IT, but focuses on the requirements and methods of scientific data management and analysis.
3 In the (very) beginning … In the (very) beginning was the application …Small set of files with (semi-)standardized internal format.Large and complex libraries for accessing file content.Simple(?) scripting languages for glue.Examples: IMS + COBOL + JCL, NetCDF/HDF/FITS + C/C++ + Perl/Python etcCommercial Data Management: Rapid adoption of RDBMS / SQL. Why?Ad hoc (for each task) data model requirement (no industry standards).More demanding quality of service guarantees (transactions, access control).Enormous pent up demand for data sharing and collaboration.Commercial data management was process oriented.Scientific Data Management: Went a different route. Why?Data consumers and producers in different communities. (Sneaker-net).Science organized into project teams: goal oriented.Technical innovation (algorithm development) as important as scientific progress.
4 Sequoia 2000 5 Year Investigation into Scientific Data Management – University of California System, DARPA, Digital Equipment Corp.Collaboration between Computer Science types (Mike Stonebraker at UC Berkeley, Jim Gray at DEC) and users of Geographic Data (Dozier/Frew at UC Santa Barbara, UCLA and UC San Diego Climate Modelers)EOS-DIS Alternative Architecture StudyFirst wide area network (connecting UC Campuses) at “T3” bandwidth (100MBps)Postgres 4.3 – R-Trees, spatial types etc. Eventually, PostgreSQL and PostGISThe Propeller-heads and the Dirt-bags“Ignorance raised to the power of arrogance.” – James FrewComputer Scientists – “What do you mean your data’s square?”Earth Scientists – “What do you mean more than one person can read and write the same data at the same time?”
5 (Hard-won) Lessons Collaboration isn’t easy … Different teams spoke different languages …… even within related scientific disciplines.Dirt-bags had more to gain than Propeller-headsTechnology that enables collaboration.Ask questions (queries), don’t write programs.Propeller-heads had more to learn than Dirt-bagsSQL might be a $10 billion market, but it doesn’t do:Image processing, numerical analysis, time-series, HDF, etc.Strong Claim # 1 : Inter-disciplinary innovation is necessary for us to make significant progress.
6 XLDB Conferences – 2008 Bring together Propeller-head and Dirt-bags 2008 Thought: How to do next-gen Science?Large Hadron ColliderLarge Synoptic Survey TelescopeInitial survey of science requirements that informed the design and implementation of SciDB.2009 Thought: What about Industrial Big Data Users?Turned out, industrial data sizes will be 10x scientific!“Internet of things”Now 3 Annual Conferences – US, Europe, Asia
7 Big Science – Research Systems How can we use DBMS technology to help Scientific data management?SLOAN-Digital Sky Survey (http://www.sdss.org/)A database of astronomical objects.Query-centric interfaces, web-facing APIs.TeraServer – (http://www.terraserver.com/)Point the “big eye” downCommercial application of remote sensing data.NIH – 1,000 Genomes OnlinePowered by SciDB since 20108T of data online, 3,000 analytic sessions per day.Growing as fast as they can …
8 Where are the Propeller-Heads? Playing Football like Seven Year OldsDocuments! Hadoop! Triple-Stores! Graph-DBs!Take a technology with proven value in a specific use-case …… declare it to be the Next Big Thing (it will crush SQL/RDBMS!) …… chase each new idea like seven-year-olds chase a ball.Roll the Clock Forward to 2014Hadoop Providers are (Re-)Implementing SQLHIVE, Cloudera’s Impala, Hortonworks Stinger, YARN + SparkStrong Claim # 2: One size (one technical architecture) does NOT fit all problem domains.
9 Jim Gray and the Fourth Paradigm Who was Jim?Turing Award WinnerArchitect of $erious $ystems(Ultimate Propeller-head)What is the “Fourth Paradigm”?eScience“Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.”
10 The “Big Idea” SlideThe methodologies used to analyze scientific data are central to how we understand our world.Ubiquitous networks of sensors will render much of the world an empirical or scientific phenomenon.How to store all that data?How do we share that data?How will be reason about it?How can such development be made to work economically?“Increasingly, scientific breakthroughs life’s everyday decisions will be powered by advanced computing capabilities that help researchers everyone manipulate and explore massive datasets.”
11 Challenges Collaboration Overcoming the “language challenges” inherent when attempting any inter-disciplinary project.Sensitivity to legal and ethical issues: privacy.Information IntegrationTechnical standards for data communication.Data cleanliness, and identifying common information.Visualization and Simple (but not too simple!) InterfacesNothing to add!Ubiquitous Availability“We have to do better at producing tools to support the whole research cycle—from data capture and data curation to data analysis and data visualization.” (Gray’s Turing Lecture)
12 if you are reading this in the 21st Century. Ergo, SciDBHow do you find out what Dirt-bags want?You ask them!?Arrays (or Matrices) as the basic structural building blockAlgebra of array manipulation operations as APIDistributed computation (cloud or cluster) for scaleIntegrated processing and storage platformExtensible framework (to allow for algorithm innovation)Provenance (track data through its life-cycle) and no-overwrite storageClient languages of choice: ‘R’, Python, not 4GLs or C/C++In-situ data access (as well as providing a data store)M. Stonebraker, J. Becla, D.J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S.B. Zdonik, "Requirements for Science Data Bases and SciDB", ;in Proc. CIDR, 2009orif you are reading this in the 21st Century.
13 Commodity clusters or cloud Why SciDB?Big analyticswithout big hasslesR, Python, Matlab, Julia,…MPPStorageandComputeArray data modelComplex analyticsCommodity clusters or cloud
14 SciDB – The (Very) Short Tour – 1. CREATE ARRAY Example< data : float >[ X=0:*,1000,0, J=0:*,1000,0 ];CREATE ARRAY geodata< track-index : intl6, scanindex : intl6,height : intl6, sensorzenith : float, sensorazimuth : float,range : uint32, solarzenith : float, solarazimuth : float,landseamask : uint8>[ longitude = : , 50000, 0latitude = : , 50000, 0,start_time = : , 1, 0,platformid = 0 : 1, 1, 0,resolutionid = 0 : 2, 1, 0];Array Data Model“Data Management for Squares”Example Array from: Planthaber, Gary Lee, Jr. “MODBASE : a SciDB-powered system for large-scale distributed storage and analysis of MODIS earth remote sensing data” MIT 2012
15 SciDB – The (Very) Short Tour – 2. Query LanguagesHigh Level API – What not HowSELECT SUM ( data ) AS Sum_DataFROM between ( Example, 500, 500, 1500, 1500 );SELECT MEDIAN( height ) AS Median_Height,AVG ( height ) AS Avg_HeightFROM slice ( geodata, platformid, 3 )WHERE sensorzenith < 35.0REGRID AS ( PARTITION BY longitude 1000,latitude 1000,start_time geo_range ( ’10 days’ ) );AQL looks a bit like SQL, but the underlying algebra is arrays, not sets.
16 SciDB – The (Very) Short Tour – 3. project (apply (join (filter (Masks, name =‘California’),geodataheight_color,calc_height_color(height)));AFLFunctional, array level manipulationFamiliar to ‘R’ and Python UsersComposible Query Languages allow you to build sophisticated programs by combining simple building blocks.SELECT MEDIAN( height ) AS Median_Height,AVG ( height ) AS Avg_HeightFROM slice ( geodata, platformid, 3 )WHERE sensorzenith < 35.0REGRID AS ( PARTITION BYlongitude 1000,latitude 1000,start_time geo_range ( ’10 days’ ));
18 Scientific “Big Data”, “Big Analytics” Dark Matter Detector – LUX1 TB per day100 collaborators (research grants)Find “interesting” particle collisions in a barrel holding 370 liters of liquid helium, where interesting events are very rare.Metabolic Atlas – Mass Spectrometry DBGenomics + Phenotype + Proteomics“What is alive in this drop of sea-water?”Next Generation Genome SequencingCost of sequencing a human genome is collapsing.$1M, $10K, $1KData per sequencing process is growing.Ion Torrent Sequencing – 80 B reads of 400 bp / $1 per M bp in 2 hoursGene sequencing be a routine part of medicine by the end of the decade.
19 Surprises Along the Way There is commercial demand for SciDB!Image processing applications in Radiology, Bio-IT.Remote sensing applications interesting to various Govt. agencies and some commercial entities (agriculture, logistics).Geo-located sensors in vehicles; driver behavior for insurance.Arrays for more than just imagesGenome database: 2D array [ sample x base_pair ]Timeseries data: [ anything x time ]Graph Analytics: [ calling_phone_# x called_phone_# ]Traction on the “Scientific Warehouse”Cost savings by centralizing infrastructureProductivity advantages from cross-team collaboration
20 Strong Claim # 3: Tools and methodologies that have traditionally been restricted to “scientific” research will become central to “commercial” and “industrial” data processing.
21 Conclusions Dirt-bags told the Propellor-heads what they wanted … Scalable, flexible array storage and data processing.Platform for collaborative analytics on machine-data.Propellor-heads responded …Not just SciDB!MonetDB, Rasdaman, InfluxDB – all array DBMSsStruggling to fit scientific data processing into other paradigms – SQL + HFDS.And not a moment too soon!Shift of management approach in Big Science towards shared infrastructure (cost saving, productivity).Multiple “commercial” consumers who need to use scientific tools and methods in their analysis.