Presentation on theme: "The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in just a Few Seconds Martin L. Kersten Stratos Idreos Stefan Manegold Erietta."— Presentation transcript:
The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in just a Few Seconds Martin L. Kersten Stratos Idreos Stefan Manegold Erietta Liarou (and members of the CWI database group)
Science Feb’11 Data http://www.sciencemag.org/site/special/data/
Science Feb’11 Data …. We have recently passed the point where more data is being collected than we can physically store. This storage gap will widen rapidly in data-intensive fields. Thus, decisions will be needed on which data to archive and which to discard. A separate problem is how to access and use these data. Many data sets are becoming too large to download. Even fields with well-established data archives, such as genomics, are facing new and growing challenges in data volume and management. And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used….
Database research vision Throwing away data before harvesting is the worst ROI one can imagine. LSST budget is 100 M$ – During its ten-year survey, LSST will acquire 5.6 million 15-second images, spread over 2.8 million pointings. – 20 billion rows in the Object table, 3 trillion rows in the Source table
Database technology is not designed for the challenges All sizes don’t fit
The Dawn of a new Database Era Capture the query intent !
FIVE STEPS INTO THE FUTURE One-minute DBMS for real-time performance. Multi-scale query processing for gradual exploration. Post processing for conveying meaningful data. Query morphing to adjust for proximity results. Query alternatives to cope with lack of providence.
One-minute database kernels Step 1: Do the BEST you can within a given time frame ! Research how to … – organize query evaluation around what is available at low cost – redesign algorithms and operators such that they adaptively avoid expensive steps normally needed for correctness and completeness – stop process after agreed upon time – ensure continuation upon request.
Multi-scale query processing Step 2: Use a staging scheme for query evaluation ! Research how to … – partition the database for producing incremental valuable results D => D1 union (D2.1 union (D2.2 union (D2.3 union.. – avoid harmful SELECT * FROM table queries – break a query into a converging query sequence Q => Q1 union Q2 => Q1 union Q2.1 union Q2.2 => Q1 union Q2.1 union Q2.2.1 union Q2.2.2 …….
Result-set post processing Step 3: Use meaningful compression to convey more ! Research how to … – post-process results sets statistically – prepare for facetted query answers – show sort for boundaries first Min/max domain enclosures for all attributes
Query morphing Step 4: Bend the search towards interesting areas ! Research how to … – explore the query expression space? – transform a query with small result set such that it produces relevant, nearby answers
Result-set post processing Step 5: Ignore stupid questions, give hints instead ! Research how to … – find alternative queries in terms of expressiveness + performance – Better exploit the query log for hints -- Q1: Using the time budget. (36291322 tuples) SELECT ra, dec, band1, intensity1, type FROM PhotoObj; -- Q2: Using data statistics. (879300 tuples) SELECT * FROM PhotoObj WHERE ra BETWEEN 53 AND 54 AND dec BETWEEN 80 AND 82; -- Q3: Using query statistics. (899 tuples) SELECT * FROM PhotoObj WHERE ra BETWEEN 53 AND 54 AND dec BETWEEN 80 AND 82 AND distance(ra,dec,radius) < 10; SELECT * FROM PhotoObj
The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in just a Few Seconds It is not a new hype, It is not about delta research, It is not a revolution misunderstood by the masses, BUT It is our conscious technological evolution, It is our harvesting past experiences, It is our full awareness of the real challenges