Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh.

Similar presentations


Presentation on theme: "Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh."— Presentation transcript:

1 Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

2 M.Kersten Mar 2010 Departure for a journey CWI Database Architecture Group Core business: To research efficient and effective database technology To deploy this technology in real-life application settings To disseminate this knowledge as open-source software Key research issues What is the ultimate (virtual) machine architecture and software stack for database processing? DIR Edinburgh

3 The Big Data Bang M.Kersten Mar 2010DIR Edinburgh

4 M.Kersten Mar 2010 Outline Departure for a journey Mapping unknown territory Crossing the Great Divide Stepping stone 1: Multimedia Dimension Stepping stone 2: Geometric Dimension Stepping stone 3: Lineage Dimension Stepping stone 4: Heterogeneous Databases Stepping stone 5: Semantic Search Stepping stone 6: Wireless sensor databases Stepping stone 7: Distributed Databases Arrival and outlook SciDB and SciLens ambitions Teaming up and making it a success DIR Edinburgh

5 M.Kersten Mar 2010DIR Edinburgh

6 M.Kersten Mar 2010 SkyServer provides public access to SDSS for astronomers, students, and wide public A project to make a map of a large part of the Universe 230 million object images 1 million spectra 4TB catalog data 9TB images DIR Edinburgh

7 M.Kersten Mar 2010 SkyServer Schema 446 columns >370 million rows Vertical fragment of 100+ popular columns Materialized join of Photo and Spectra DIR Edinburgh

8 M.Kersten Mar 2010 Initial exploration DIR Edinburgh

9 M.Kersten Mar 2010 Initial exploration DIR Edinburgh

10 M.Kersten Mar 2010 Initial exploration e-Science concentrates on three main areas, viz., (A) to facilitate sharing of experimental devices. (B) the development of tools for the management and sharing of scientifc data, (C) the development of analytical and simulation tools, over scientific phenomenon and DIR Edinburgh

11 M.Kersten Mar 2010 Mapping unknown territory Multimedia Images Geometric Mapping Features Space Annotations Modelling (Atlas) AstronomyNeuroscience … … … … … … GeophysicsBiosciences DIR Edinburgh

12 One size fits all? M.Kersten Mar 2010DIR Edinburgh Pico scale Mega scale Structured semi-structure documents images Oracle MS SQLserver DB2 Vertica MonetDB Postgresql Mysql, MariaDB SQLite MongoDB LucidDB NoSQL

13 We have to stand the storm M.Kersten Mar 2010DIR Edinburgh

14 M.Kersten Mar 2010 Stepping stone 1: Multimedia Dimension Storage challenges: Large volumes (>Tbyte, >Pbyte) of raw data Partitioning based on image, video segmentation Indexing based on feature vectors Query challenges: Proximity and probability based search CPU intensive, user defined predicates Content-based information retrieval DIR Edinburgh

15 M.Kersten Mar 2010 Stepping stone 1: Multimedia Dimension The database consists of 100.000 images. From each image we extract 25 patches For each patch a 14-dimensional feature vector is derived 2.500.000 images Challenge, find similar images based on Euclidian distance with sub-second response time. Solution, novel database algorithms to solve K-nearest neighbours (k-NN) search Lessons: start from generative models. DIR Edinburgh

16 M.Kersten Mar 2010 Stepping stone 1: Multimedia Dimension Alternative scheme, determine the probability that an image can be generated with a limited number of Guassian mixtures Fix a limited number of GMM and use an Expectation Maximization algorithm to fit the model over the image Search similar images by comparison of the GMM model parameters DIR Edinburgh

17 M.Kersten Mar 2010 Probabilistic Image Dimension Query: Which of the models is most likely to generate these 24 samples? DIR Edinburgh

18 M.Kersten Mar 2010 Probabilistic Image Dimension ? DIR Edinburgh

19 M.Kersten Mar 2010 Stepping stone 2: Geometric Dimension Any geometric abstraction of reality provides a good navigational map Database storage and indexing support for 2D is mature R-trees and Quad-trees Commercial database vendors do ‘not like them’ Open research issue is to support 2D query embedding Scaling out towards 3-, 4-, dimensions and temporal support Examples: researched extensively in Geographical Information Systems. Google-map is omnipresent or openGIS Lessons: avoid abundance of reference models, baroque datastructures not necessarily scale DIR Edinburgh

20 M.Kersten Mar 2010 Stepping stone 3: Lineage Dimension The problem encountered in many scientific databases is to ensure data lineage, the ability to travel back in time to understand, redo and judge the derivations. How to keep track of the complete context? Data, software, parameter settings,… How to redo part of the analysis ? How to store and remember the lineage trails? Example: AstroWise project in Groningen keeps track of a complete workflow for telescope data analysis in a large Oracle database. All derivations are 5-line python programs. Lesson: don’t be afraid for storage cost, be an accountant DIR Edinburgh

21 M.Kersten Mar 2010 Stepping stone 4: Heterogenous Databases A key problem is to share heterogeneous information Use commonly approved vocabulary and standard syntax XML is the language inter-galactica for self-descriptive data and its exchange between software systems RDF claims to be the next king The database community was actively working on XML, XQuery, and Xupdate database engines, but it is not easy ! Challenges, how to scale to large XML stores ? How to efficiently search components? How to realize structural information retrieval? RDF world brings in graph-algorithms Lessions: science is done, jewels are captured by bandits DIR Edinburgh

22 M.Kersten Mar 2010 Database and Informatics Working Group FBIRN 2005 – David Keator MR scanner scanner- or software-specific file formats XML-based events file XML-based image header image pre- processing event analysis fBIRN pipeline “big picture” DIR Edinburgh

23 M.Kersten Mar 2010 Stepping stone 5: Semantic search Ontology integration is one of the most pressing challenges for the semantic web to take off. Integration of technology with databases is still immature. RDF and OWL are the leading paradigms, SPARQL is an attempt to bridge the gap between traditional database management and semantic web technology. Lessons: not a technological issue, but an educational and cultural issues http://e-culture.multimedian.nl/demo/search DIR Edinburgh

24 M.Kersten Mar 2010 Stepping stone 6: Sensor Databases Database management functionality can be downscaled to the level of small sensor-enabled devices. They can form ad- hoq networks and provide a straightforward SQL interface for aggregation. The focus is on network based aggregation under severe energy limitations. Embedded database systems are not up to the job. Positive case studies include TinyDB on TinyOS (Berkeley) The DataCell project at CWI ( and Philips) aims to provide for a more expressive query language and application interface. DIR Edinburgh

25 M.Kersten Mar 2010 sensor cluster mobile stationary distributed sensor net mobile sensor cluster integrated management distributed management Research World Perspective PC-less sensor net AmbientDB Semantic Sensors PastFuture DIR Edinburgh

26 M.Kersten Mar 2010 Stepping stone 7: MR/DDBMS HPC … Grids …. Clouds … Grids are focussed on high-performance computing with a focus on Authentication-Authorization-Access and data shipping over wide-area networks. Map-reduce technology is a re-invention of re-scaled distributed database technology and distributed programming. Data distribution, replication, and parallel query processing is well studied over the last 3 decades !! Lessions: application programmers are infected by “not- written-by-me” hype bacteria DIR Edinburgh

27 MonetDB in the large MonetDB/Map-reduce Pure map-reduce approach driven by query streams leading to self-organising distributed database. MonetDB/Octopus Dynamic partial replication of databases with economic model for reallocation and recycler technology MonetDB/Datacyclotron Let the database hotset flow like a stream or particles through a large and fast ring-connected machines, e.g. a data collider M.Kersten Mar 2010DIR Edinburgh

28 Get our hands dirty M.Kersten Mar 2010DIR Edinburgh Toys Tools & Techniques

29 The MonetDB product family MonetDB kernel MAPI protocol JDBC C-mapi lib Perl End-user application ODBC PHP Python SQL XQuery RoR

30 M.Kersten Mar 2010 The MonetDB Software Stack XQuery MonetDB 4 MonetDB 5 MonetDB kernel SQL 03 Optimizers GIS SQL/XML SOAP Open-GIS An advanced column-oriented DBMS compile DIR Edinburgh

31 An advanced column-oriented DBMS The MonetDB Software Stack MonetDB 5 MonetDB kernel SQL 03 Optimizers Extensions Orthogonal extension of SQL03 Clear computational semantics Minimal extension to MonetDB

32 30/06/2009 SIGMOD'09 Providence, RI An Architecture for Recycling Intermediates M. Ivanova, M. L. Kersten, N. Nes, R. Goncalves 32/20 Run-time Support Recycler Optimizer MonetDB Recycler Architecture SQL MonetDB Server Tactical Optimizer MonetDB Kernel XQuery MAL Recycle Pool function user.s1_2(A0:date,...):void; X5 := sql.bind("sys","lineitem",...); X10 := algebra.select(X5,A0); X12 := sql.bindIdx("sys","lineitem",...); X15 := algebra.join(X10,X12); X25 := mtime.addmonths(A1,A2);... function user.s1_2(A0:date,...):void; X5 := sql.bind("sys","lineitem",...); X10 := algebra.select(X5,A0); X12 := sql.bindIdx("sys","lineitem",...); X15 := algebra.join(X10,X12); X25 := mtime.addmonths(A1,A2);... Admission & Eviction

33 SciDB and SciLens projects Design and implement a database management system better geared at the requirements of scientific applications SciDB vision (http://www.scidb.org) Array datamodel is missing Distributed, map-reduce processing from the start No-cost loading of data … redo all the hard work from the ground up SciLens Multi-paradigm software layer Database summarisation is the key … build on the shoulders of the MonetDB team M.Kersten Mar 2010DIR Edinburgh

34 M.Kersten Mar 2010 Teaming up and making it a success Crossing the Great Divide is challenging and rewarding iff Building the bridge starts from both ends Parties recognize and respect each others core business Open-source database technology provides a sound basis to manage sizeable scientific databases To capitalize and steer expertise development The database community can provide knowledge on modelling, query processing, algorithms, data structures, scalability, persistency, …and flexible database systems The MonetDB team seeks new frontiers in scalable structured database management DIR Edinburgh

35 M.Kersten Mar 2010 DIR Edinburgh


Download ppt "Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh."

Similar presentations


Ads by Google