“One Size Fits All” An Idea Whose Time Has Come and Gone by Michael Stonebraker.

Slides:



Advertisements
Similar presentations
Alternate Title The elephants are selling 30 year old “bloatware”
Advertisements

Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Memory.
Chapter 10: Designing Databases
Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
One Size Fits All An Idea Whose Time Has Come and Gone by Michael Stonebraker.
The End of an Architectural Era Shimin Chen (Big Data Reading Group) (many slides are copied from Stonebraker’s presentation)
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Multidimensional Database in Context of DB2 OLAP Server Khang Pham Class: CSCI397-16C Instructor: Professor Renner.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
CS 104 Introduction to Computer Science and Graphics Problems
Memory Management.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
The POSTGRES Next - Generation Database Management System Michael Stonebraker Greg Kemnitz Presented by: Nirav S. Sheth.
SPONSORS. Microsoft PowerPivot for SQL Server, Excel 2010, and SharePoint 2010 Michael Herman Syntergy, Inc.
A Spotfire Demo Gallery with Data Science Dr. Brand Niemann Director and Senior Data Scientist Semantic Community November 13, 2011 DRAFT 1.
Paging. Memory Partitioning Troubles Fragmentation Need for compaction/swapping A process size is limited by the available physical memory Dynamic growth.
CS333 Intro to Operating Systems Jonathan Walpole.
1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.
Review of Memory Management, Virtual Memory CS448.
1 CS 430 Database Theory Winter 2005 Lecture 1: Introduction.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
C-Store: Column-Oriented Data Warehousing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Michael Soffner A Variability Model for Query Optimizers Michael Soffner 1, Norbert Siegmund 1, Marko Rosenmüller 1, Janet Siegmund 1, Thomas.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
IT253: Computer Organization
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
1 C-Store: A Column-oriented DBMS By New England Database Group.
Chapter 12: Designing Databases
1 Address Translation Memory Allocation –Linked lists –Bit maps Options for managing memory –Base and Bound –Segmentation –Paging Paged page tables Inverted.
1 “One Size Fits All” An Idea Whose Time Has Come and Gone by Michael Stonebraker.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
1 “One Size Fits All” An Idea Whose Time Has Come and Gone by Michael Stonebraker.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Mapping the Data Warehouse to a Multiprocessor Architecture
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: In- Memory DB (IMDB) and Column-Oriented DB.
CSCI5570 Large Scale Data Processing Systems
Cleveland SQL Saturday Catch-All or Sometimes Queries
File System Implementation
COMP 430 Intro. to Database Systems
Main Memory Management
Chapter 8: Main Memory.
Operating System Concepts
Computer Architecture
Main Memory Background Swapping Contiguous Allocation Paging
Lecture 3: Main Memory.
Column-Stores vs. Row-Stores: How Different Are They Really?
Database Design and Programming
Page Main Memory.
Presentation transcript:

“One Size Fits All” An Idea Whose Time Has Come and Gone by Michael Stonebraker

Co-conspirators Co-conspirators  StreamBase benchmarking: John Lifter  Vertica benchmarking: Chuck Bear  ASAP design and benchmarking: Stavros Harizopoulos*, Jennie Rogers, Tingjien Ge  4* wizard DBA: Nabil Hachem  Kibitzers: Ugur Cetintemal, Stan Zdonik, Mitch Cherniack * Looking for a job

Current DBMS Gold Standard Current DBMS Gold Standard  Store fields in one record contiguously on disk  Use B-tree indexing  Use small (e.g. 4K) disk blocks  Align fields on byte or word boundaries  Conventional (row-oriented) query optimizer and executor

Terminology -- “Row Store” Record 2 Record 4 Record 1 Record 3 E.g. DB2, Oracle, Sybase, SQLServer, …

Row Stores Row Stores  Can insert and delete a record in one physical write  Good for business data processing (the IMS market of the 1970s)  And that was what System R and Ingres were gunning for

Extensions to Row Stores Over the Years  Architectural stuff (Shared nothing, shared disk)  Object relational stuff (user-defined types and functions)  XML stuff  Warehouse stuff (materialized views, bit map indexes)  ….

Assertion Assertion  There are at least 4 (non trivial) markets where a row store can be clobbered by a specialized architecture  “Clobbered” means X10 performance or more

In the Paper….  Performance bakeoff numbers that validate the assertion for  Data warehouses  Stream processing  Scientific and intel data bases  And a fluffy argument that assertion is also true for text (Google. Yahoo, …)

Data Warehouses  Two apples-to-apples benchmarks  Real customer telco app (Vertica vs an appliance)  Variant of TPC-H (Vertica vs an elephant)  Using professionally tuned software  On common hardware (in the elephant case)

Telco Call Detail Benchmark Telco Call Detail Benchmark  Vertica 47X a popular appliance on 1/7 the resources and 1/100 the hardware cost  Why?  Queries read 6-7 of 212 columns -- column stores have a huge advantage  Compression – column stores compress better than row stores

Telco Call Detail Benchmark Telco Call Detail Benchmark  Why?  Indexing/ordering – appliance doesn’t do any  Vertica executor runs on compressed data  Less main memory data copying  Better L2 cache performance

Skinny Fact Table (simplified TPC-H)  Vertica 8X a very popular row store in ½ the space (same materialized views)  Vertica 35X the same row store with equal space budget (actually 2/3)  Both systems used partitioning, compression,and were tuned by wizards

Why 8X?  Less data read  Better compression  Less main memory copying  Better L2 cache performance

Stream Processing  Virtual feed  Create a “first arriver” Wall Street composite feed  Split adjusted price  From a Tick feed and a Split feed, produce “split adjusted price” feed Both of these are real customer POCs (as opposed to Linear Road)

Stream Processing Results  StreamBase 25X an elephant  If required state implemented as an RDBMS table  StreamBase 7X an elephant  If required state implemented as local variables in a data base procedure (i.e. no use of the DBMS)

Why?  Embedded application – not client - server  Compile operations to machine code, not an intermediate form  Optimized for pushing 1 record through a workflow – not joining 1M records to 1M records  Operations don’t queue results – directly call next operator  Time windows as basic primitive

A Note in Passing  Some stream engines are implemented on top of DBMS technology  i.e. filters, join performed by the embedded DBMS  i.e. time windows implemented as DBMS tables  Costs more than one order of magnitude in performance  Lose elephant advantage!

Another Note in Passing…. StreamSQL is the obvious paradigm to mix real time processing with lookup of state information Select T.symbol, price = T.price * S.factor, T.volume, T.time From Ticks T, Storage S Where S.symbol = T.symbol

Third Area – Scientific and Intel Apps  Artificial (simple) benchmark  Comparing  ASAP (new Brown/Brandeis/MIT prototype)  Matlab  An elephant  On some simple array calculations  But arrays are big

Scientific and Intel Results Scientific and Intel Results  ASAP > 100X the elephant  ASAP ~ 10X Matlab (high variance)

Why? Why?  Chunky Store  Fundamental storage unit is an “array chunk” (reminiscent of Sarawagi’s work)  Regular and irregular indexes  Sparse and dense arrays

Why? Why?  Compression  Regular indexes not stored  Delta compression in any direction (reminiscent of MPEG)

Why? Why?  Standard array operations as primitives, plus:  regrid  locate  pivot  Not simulated on top of relational primitives

Other stuff Other stuff  Seamless integration of real time and stored state (Intel guys go ga-ga)  StreamSQL for arrays!  Lineage (simpler, more efficient, model than Trio)  Uncertainty (different than Trio)

ASAP ASAP  Real-time stuff adapted from Aurora/Borealis  Demo-able  New storage system from scratch  Enough works to get some numbers

Demo Demo  Two video cameras: IR and conventional  Forward the better image on a frame-by- frame basis as lighting changes

Query Network Query Network

Text Text  Search guys don’t use DBMSs  Too slow  No need for XACTS  Run only one query  No need for 100% precision  ….

So What is an RDBMS Elephant to do? So What is an RDBMS Elephant to do?  Yawn  Always been high end specialization for a few crazy lunatics  K engines united by a common parser  StreamSQL is a step in this direction

So What is an RDBMS Elephant to do? So What is an RDBMS Elephant to do?  Data federations of incompatible systems  Full employment act for CS folks forever  A new (much more general storage engine)  E.g. morph between rows, columns and chunks

Obvious Research Agenda Obvious Research Agenda  Find a market where OSFA doesn’t work and customers are in pain  Figure out what does

More General Issue More General Issue  Fast stream processing engines don’t use the standard system software stack (web servers, app servers, DBMS)  How many other refactorings of system software capabilities are there?

The Curse  May you live in interesting times