Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.

Similar presentations


Presentation on theme: "1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group."— Presentation transcript:

1 1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group Presentation by Shimin Chen

2 2 M.I.T Relational Database Record 1 Record 2 Record 3 Attribute1Attribute2Attribute3 e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)

3 3 M.I.T Current DBMS -- “Row Store” Record 2 Record 4 Record 1 Record 3 E.g. DB2, Oracle, Sybase, SQLServer, …

4 4 M.I.T Row Stores are Write Optimized (use white board) Row Stores are Write Optimized (use white board)  Store fields in one record contiguously on disk  Use small (e.g. 4K) disk blocks  Use B-tree indexing  Align fields on byte or word boundaries  Assume shifting data values is costly  Transactions: write-ahead logging  Store fields in one record contiguously on disk  Use small (e.g. 4K) disk blocks  Use B-tree indexing  Align fields on byte or word boundaries  Assume shifting data values is costly  Transactions: write-ahead logging

5 5 M.I.T Row Stores are Write Optimized Row Stores are Write Optimized  Can insert and delete a record in one physical write  Good for on-line transaction processing (OLTP)  But not for read mostly applications  Data warehouses  Customer Relationship Management (CRM)  Electronic library card catalogs  …  Can insert and delete a record in one physical write  Good for on-line transaction processing (OLTP)  But not for read mostly applications  Data warehouses  Customer Relationship Management (CRM)  Electronic library card catalogs  …

6 6 M.I.T Column Stores

7 7 M.I.T At 100K Feet….  Read-optimized:  Periodically a bulk load of new data  Long period of ad-hoc queries  Benefit:  Ad-hoc queries read 2 columns out of 20  Column store reads 10% of what a row store reads  Previous pioneering work: Sybase IQ (early ’90s) Monet (see CIDR ’05 for the most recent description)  Read-optimized:  Periodically a bulk load of new data  Long period of ad-hoc queries  Benefit:  Ad-hoc queries read 2 columns out of 20  Column store reads 10% of what a row store reads  Previous pioneering work: Sybase IQ (early ’90s) Monet (see CIDR ’05 for the most recent description)

8 8 M.I.T C-Store Technical Ideas  Data storage:  Only materialized views (perhaps many)  Compress the columns to save space  No alignment  Big disk blocks  Innovative redundancy  Optimize for grid (cluster) computing  Focus on Sorting not indexing  Automatic physical DBMS design  Column optimizer and executor  Data storage:  Only materialized views (perhaps many)  Compress the columns to save space  No alignment  Big disk blocks  Innovative redundancy  Optimize for grid (cluster) computing  Focus on Sorting not indexing  Automatic physical DBMS design  Column optimizer and executor

9 9 M.I.T How to Evaluate This Paper….  None of the ideas in isolation merit publication  Judge the complete system by its (hopefully intelligent) choice of  Small collection of inter-related powerful ideas  That together put performance in a new sandbox  None of the ideas in isolation merit publication  Judge the complete system by its (hopefully intelligent) choice of  Small collection of inter-related powerful ideas  That together put performance in a new sandbox

10 10 M.I.T Outline  Overview  Read-optimized column store  Query execution and optimization  Handling transactional updates  Performance  Summary

11 11 M.I.T Data Model  Projection (materialized view):  some number of columns from a fact table  plus columns in a dimension table – with a 1-n join between Fact and Dimension table  (conceptually) no duplicate elimination  Stored in order of a storage key(s)  Note: base table is not stored anywhere  Projection (materialized view):  some number of columns from a fact table  plus columns in a dimension table – with a 1-n join between Fact and Dimension table  (conceptually) no duplicate elimination  Stored in order of a storage key(s)  Note: base table is not stored anywhere

12 12 M.I.T Example  Logical base tables: –EMP (name, age, salary, dept) –DEPT (dname, floor)  Example projections –EMP1 (name, age | age) –EMP2 (dept, age, DEPT.floor | DEPT.floor) –EMP3 (name, salary | salary) –DEPT1 (dname, floor | floor)

13 13 M.I.T Optimize for Grid Computing  I.e. shared-nothing  Horizontal partitioning and intra-query parallelism as in Gamma  Paper talks about “Grid computers … may have tens to hundreds of nodes …”  I.e. shared-nothing  Horizontal partitioning and intra-query parallelism as in Gamma  Paper talks about “Grid computers … may have tens to hundreds of nodes …”

14 14 M.I.T Projection Detail #1  Each projection is horizontally partitioned into “segment”s –Segment identifier –Unit of distribution and parallelism –Value-based partitioning, key range of sort key(s)  Column-wise store inside segment  Storage key: ordinal record number in segment –calculated as needed

15 15 M.I.T Projection Detail #2  Different encoding schemes for different columns  Depends on ordering and value distribution –Self-order, few distinct values: (value, position, num_entries) –Foreign-order, few distinct values: (value, bitmap), bitmap is run-length encoded –Self-order, many distinct values: block-oriented, delta value encoding –Foreign-order, many distinct values: gzip

16 16 M.I.T Different Indexing Few valuesMany values Sequential (self-order) RLE encoded Conventional B-tree at the value level Delta encoded Conventional B-tree at the block level Non sequential (foreign-order) Bitmap per value Conventional Gzip Conventional B-tree at the block level

17 17 M.I.T Big Disk Blocks  Tunable  Big (minimum size is 64K)  Tunable  Big (minimum size is 64K)

18 18 M.I.T Reconstructing Base Table from Projections  Join Index: –Projection T1 has M segments, projection T2 has n segments –T1 and T2 are on same base table –Join index consists of M tables, one per T1 segment –Entry: segment ID and storage key of corresponding record in T2  In general, needs multiple join indices for reconstructing a base table  Join index is costly to store and maintain –Each column expected to be in multiple projections –Reduce # of join indices

19 19 M.I.T Innovative Redundancy  Hardly any warehouse is recovered by redo from log  Takes too long!  Store enough projections to ensure K-safety  Column can be in K different projections  Rebuild dead objects from elsewhere in the network  Hardly any warehouse is recovered by redo from log  Takes too long!  Store enough projections to ensure K-safety  Column can be in K different projections  Rebuild dead objects from elsewhere in the network

20 20 M.I.T Automatic Physical DBMS Design  Accept a “training set” of queries and a space budget  Choose the projections and join indices auto-magically  Re-optimize periodically based on a log of the interactions  Accept a “training set” of queries and a space budget  Choose the projections and join indices auto-magically  Re-optimize periodically based on a log of the interactions

21 21 M.I.T Outline  Overview  Read-optimized column store  Query execution and optimization  Handling transactional updates  Performance  Summary

22 22 M.I.T Operators  Decompress  Select: generate bitstring  Mask: bitstring+projection  selected rows  Project: choose a subset of columns  Concat: combine multiple projections that are sorted in the same order  Sort  Permute: according to a join index  Join  Aggregation operators  Bitstring operators

23 23 M.I.T Execution  Query plan: a tree of operators (data flow) –Leaf: accessing the data storage –Internal: calls “get_next”  Operators return 64KB blocks

24 24 M.I.T Column Optimizer (discussion)  Cost-based estimation for query plan construction  Chooses projections on which to run the query  Cost model includes compression types  When to perform “mask” operator  Build in snowflake schemas  Which are simple to optimize without exhaustive search  Looking at extensions  Cost-based estimation for query plan construction  Chooses projections on which to run the query  Cost model includes compression types  When to perform “mask” operator  Build in snowflake schemas  Which are simple to optimize without exhaustive search  Looking at extensions

25 25 M.I.T Outline  Overview  Read-optimized column store  Query execution and optimization  Handling transactional updates  Performance  Summary

26 26 M.I.T Online Updates Are Necessary  Transactional updates are necessary even in read- mostly environment  Online updates for error corrections  Real-time data warehouses –Reduce the delay between OLTP system and warehouse towards zero

27 27 M.I.T Solution – a Hybrid Store Read-optimized Column store Write-optimized Column store Tuple mover (What we have been talking about so far) (Batch rebuilder)

28 28 M.I.T Write Store  Column store  Horizontally partitioned as the read store –1:1 mapping between RS segments and WS segments  Storage keys are explicitly stored –Btree: sort key  storage key  No compression (the data size is small)

29 29 M.I.T Handling Updates  Optimize read-only query: do not hold locks –Snapshot isolation –The query is run on a snapshot of the data –Ensure transactions related to this snapshot have already committed  Each WS site: insertion vector (with timestamps), deletion vector, (updates become insertions and detetions)  Maintain a high water mark and a low water mark of WS sites: –HWM: all transactions before HWM have committed –LWM: no records in read store are inserted before LWM  Queries can specify a time before HWM

30 30 M.I.T HWM and epochs  TA: time authority updates the coarse timer (epochs)

31 31 M.I.T Transactions  Undo from a log (that does not need to be persistent)  Redo by rebuild from elsewhere in the network  Undo from a log (that does not need to be persistent)  Redo by rebuild from elsewhere in the network

32 32 M.I.T Tuple-Mover  Read RS segment  Combine WS segment into a new version of the RS segment, do not update in place  Record last move time for this segment in WS  T last_move  LWM  Time authority will periodically sends out a new LWM epoch number

33 33 M.I.T Current Performance Varying storage:  100X popular row store in 40% of the space  10X popular column store in 70% of the space  7X popular row store in 1/6 th of the space  Code available with BSD license Varying storage:  100X popular row store in 40% of the space  10X popular column store in 70% of the space  7X popular row store in 1/6 th of the space  Code available with BSD license

34 34 M.I.T Summary  Column store is optimized for read queries  Cluster parallelism  Interesting data organization  Handling write queries


Download ppt "1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group."

Similar presentations


Ads by Google