MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.

Slides:

Advertisements

Similar presentations

Query Task Model (QTM): Modeling Query Execution with Tasks 1 Steffen Zeuch and Johann-Christoph Freytag.

Advertisements

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Storing Data: Disks and Files: Chapter 9

Ingres/VectorWise Doug Inkster – Ingres Development.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.

What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.

Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

M.Kersten The MonetDB Architecture Martin Kersten CWI Amsterdam.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.

DaMoN 2011 Paper Preview Organized by Stavros Harizopoulos and Qiong Luo Athens, Greece Jun 13, 2011.

Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.

Nat DucaJonathan Cohen Johns Hopkins University Peter Kirchner IBM Research Stream Caching: Mechanisms for General Purpose Stream Processing.

Chapter 15.7 Buffer Management ID: 219 Name: Qun Yu Class: CS Spring 2009 Instructor: Dr. T.Y.Lin.

Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.

Lecture 11: DMBS Internals

Wook-Shin Han, Sangyeon Lee POSTECH, DGIST

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)

Breaking the Memory Wall in MonetDB

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

March 19981© Dennis Adams Associates Tuning Oracle: Key Considerations Dennis Adams 25 March 1998.

VectorWise The world’s fastest database GIUA, 13 September 2011.

CSCE Database Systems Chapter 15: Query Execution 1.

Multi-Core Architectures

1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.

Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007.

Ingres/VectorWise Doug Inkster – Ingres Development.

What have mr aldred’s dirty clothes got to do with the cpu

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.

M.Kersten The MonetDB Architecture Martin Kersten CWI Amsterdam.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

CS4432: Database Systems II Query Processing- Part 2.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty

Database Techniek Query Processing & Cost Modeling (chapter )

ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.

DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.

Parallel Databases.

Lecture 16: Data Storage Wednesday, November 6, 2006.

Database Management Systems (CS 564)

Concurrent Data Structures for Near-Memory Computing

5.2 Eleven Advanced Optimizations of Cache Performance

Database Performance Tuning and Query Optimization

Introduction to Computer Systems

Chapter 11 Database Performance Tuning and Query Optimization

Memory System Performance Chapter 3

Instruction Level Parallelism

Presentation transcript:

MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes

Contents Introduction Motivation Research: DBMS  Computer Architecture Vectorizing the Volcano Iterator Model Why & how vectorized primitives make a CPU happy Evaluation TPC-H SF= x faster than DB2 (?) The rest of the system Conclusion & Future Work

Motivation Application areas OLAP, data warehousing Data-mining in DBMS Multimedia retrieval Scientific Data (astro,bio,..) Challenge: process really large datasets within DBMS efficiently

Research Area Database Architecture DBMS design, implementation, evaluation vs Computer Architecture  Data structures  Query processing algorithms MonetDB (monetdb.cwi.nl) at CWI Now: MonetDB/X100

Scalar  Super-Scalar “Pipelining”  “Hyper-Pipelining”

CPU From CISC to hyper-pipelined 1986: 8086: CISC 1990: 486: 2 execution units 1992: Pentium: 2 x 5-stage pipelined units 1996: Pentium3: 3 x 7-stage pipelined units 2000: Pentium4: 12 x 20-stage pipelined execution units Each instruction executes in multiple steps… A -> A1, …, An … in (multiple) pipelines: CPU clock cycle G H A B

CPU But only, if the instructions are independent! Otherwise: Problems: branches in program logic instructions depend on each others results [ailamaki99,trancoso98..]  DBMS bad at filling pipelines

Volcano Refresher Query SELECT name, salary*.19 AS tax FROM employee WHERE age > 25

Volcano Refresher Operators Iterator interface -open() -next(): tuple -close()

Volcano Refresher Primitives Provide computational functionality All arithmetic allowed in expressions, e.g. multiplication mult(int,int)  int

Tuple-at-a-time Primitives void mult_int_val_int_val( int *res, int l, int r) { *res = l * r; } *(int,int): int LOAD reg0, (l) LOAD reg1, (r) MULT reg0, reg1 STORE reg0, (res)

Tuple-at-a-time Primitives void mult_int_val_int_val( int *res, int l, int r) { *res = l * r; } *(int,int): int LOAD reg0, (l) LOAD reg1, (r) MULT reg0, reg1 STORE reg0,(res)

Tuple-at-a-time Primitives void mult_int_val_int_val( int *res, int l, int r) { *res = l * r; } *(int,int): int 15 cycles-per-tuple + function call cost (~20cycles) Total: ~35 cycles per tuple LOAD reg0, (l) LOAD reg1, (r) MULT reg0, reg1 STORE reg0,(res)

Vectors Column slices as unary arrays

Vectors Column slices as unary arrays

Vectors Column slices as unary arrays NOT: Vertical is a better table storage layout than horizontal (though we still think it often is) RATIONALE: - Primitives see relevant columns only, not tables - Simple array operations are well-supported by compilers

x100: Vectorized Primitives void map_mult_int_col_int_col( int _restrict_*res, int _restrict_*l, int _restrict_*r, int n) { for(int i=0; i<n; i++) res[i] = l[i] * r[i]; } *(int,int): int  *(int[],int[]) : int[]

x100: Vectorized Primitives void map_mult_int_col_int_col( int _restrict_*res, int _restrict_*l, int _restrict_*r, int n) { for(int i=0; i<n; i++) res[i] = l[i] * r[i]; } *(int,int): int  *(int[],int[]) : int[] Pipelinable loop

x100: Vectorized Primitives void map_mult_int_col_int_col( int _restrict_*res, int _restrict_*l, int _restrict_*r, int n) { for(int i=0; i<n; i++) res[i] = l[i] * r[i]; } Pipelined loop, by C compiler LOAD reg0, (l+0) LOAD reg1, (r+0) LOAD reg2, (l+1) LOAD reg3, (r+1) LOAD reg4, (l+2) LOAD reg5, (r+2) MULT reg0, reg1 MULT reg2, reg3 MULT reg4, reg5 STORE reg0, (res+0) STORE reg2, (res+1) STORE reg4, (res+2)

x100: Vectorized Primitives Estimated throughput LOAD reg8, (l+4) LOAD reg9, (r+4) MULT reg4, reg5 STORE reg0, (res+0) LOAD reg0, (l+5) LOAD reg1, (r+5) MULT reg6, reg7 STORE reg2, (res+1) LOAD reg2, (l+6) LOAD reg3, (r+6) MULT reg8, reg9 STORE reg4, (res+2) 2 cycles per tuple 1 function call (~20 cycles) per vector (i.e. 20/100) Total: 2.2 cycles per tuple

Memory Hierarchy Vectors are only the in-cache representation RAM & disk representation might actually be different (we use both PAX and DSM) ColumnBM (buffer manager) X100 query engine CPU cache (raid) Disk(s) networked ColumnBM-s RAM

x100 result (TPC-H Q1) as predicted

x100 result (TPC-H Q1) Very low cycles-per-tuple

MySQL (TPC-H Q1) Tuple-at-a-time processing Compared with x100: More ins-per-tuple (even more cycles-per-tuple)..

MySQL (TPC-H Q1) One-tuple-at-a-time processing Compared with x100: More ins-per-tuple (even more cycles-per-tuple) -.

MySQL (TPC-H Q1) One-tuple-at-a-time processing Compared with x100: More ins-per-tuple (even more cycles-per-tuple) Lot of “overhead” - Tuple navigation / movement.

MySQL (TPC-H Q1) One-tuple-at-a-time processing Compared with x100: More ins-per-tuple (even more cycles-per-tuple) Lot of “overhead” - Tuple navigation / movement - Expensive hash.

MySQL (TPC-H Q1) One-tuple-at-a-time processing Compared with x100: More ins-per-tuple (even more cycles-per-tuple) Lot of “overhead” - Tuple navigation / movement - Expensive hash - NOT: locking.

Optimal Vector size? All vectors together should fit the CPU cache Optimizer should tune this, given the query characteristics. ColumnBM (buffer manager) X100 query engine CPU cache (raid) Disk(s) networked ColumnBM-s RAM

Vector size impact Varying the vector size on TPC-H query 1

Vector size impact Varying the vector size on TPC-H query 1 mysql, oracle, db2 X100 MonetDB low IPC, overhead RAM bandwidth bound

MonetDB/MIL materializes columns ColumnBM (buffer manager) MonetDB/X100 CPU cache (raid) Disk(s) networked ColumnBM-s MonetDB/MIL RAM

How much faster is it? X100 vs DB2 official TPC-H numbers (SF=100)

Is it really? X100 vs DB2 official TPC-H numbers (SF=100) Smallprint -Assumes perfect 4CPU scaling in DB2 -X100 numbers are a hot run, DB2 has I/O -but DB2 has 112 SCSI disks and we just 1

Now: ColumnBM A buffer manager for MonetDB Scale out of main memory Ideas Use large chunks (>1MB) for sequential bandwidth Differential lists for updates  Apply only in CPU cache (per vector) Vertical fragments are immutable objects  Nice for compression  No index maintenance

Problem - bandwidth x100 too fast for disk (~600MB/s TPC-H Q1)

ColumnBM: Boosting Bandwidth Throw everything at this problem Vertical Fragmentation Don’t access what you don’t need Use network bandwidth Replicate blocks in other nodes running ColumnBM Lightweight compression With rates of >GB/second Re-use Bandwidth If multiple concurrent queries want overlapping data

Summary Goal: CPU efficiency on analysis apps Main idea: vectorized processing RDBMS comparison C compiler can generate pipelined loops Reduced interpretation overhead MonetDB/MIL comparison uses less bandwidth  better I/O based scalability

Conclusion New engine for MonetDB (monetdb.cwi.nl) Promising first results Scaling to huge (disk-based) data sets Future work  Vectorizing more query processing algorithms,  JIT primitive compilation,  Lightweight Compression,  Re-using I/O