Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing HBase scanner performance Mikhail Bautin Software Engineer 01/19/2012.

Similar presentations


Presentation on theme: "Optimizing HBase scanner performance Mikhail Bautin Software Engineer 01/19/2012."— Presentation transcript:

1

2 Optimizing HBase scanner performance Mikhail Bautin Software Engineer 01/19/2012

3 HBase Scanners What happens on a Get RegionScanner StoreScanner StoreFileScanne r ColumnFamily1ColumnFamily2... (R1,C1,T3) (R1,C2,T2) (R1,C2,T1) (R1,C1,T1) (R1,C2,T3) (R2,C1,T2) (R2,C2,T1)... Store = (Region, CF)...

4 HBase Scanner State What happens on a next() RegionScanner StoreScanner StoreFileScanne r ColumnFamily1ColumnFamily2... Current KeyValue Store = (Region, CF) Current KeyValue Priority Queue... Priority Queue

5 Avoiding next() on StoreFileScanner Every next() call may result in disk I/O ▪ HBASE-4433: avoid extra next if done with row/column (Kannan) ▪ An optimization for queries specifying a column set ▪ INCLUDE_AND_SEEK_NEXT_COL ▪ INCLUDE_AND_SEEK_NEXT_ROW ▪ HBASE-4434: Don't do HFile Scanner next() unless the next KV is needed (Kannan) ▪ Avoid aggressive pre-fetching

6 Simple ROWCOL Bloom Filters Do we have to read all of these files? RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1 C1T3 C2T3 C3T2 R2 C1T2 C2T3 RowColTS R1 C1T4 C2T2 R2C1T1 Query: (R1, C3)

7 Simple ROWCOL Bloom Filters In some cases, we only have to read one file RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1 C1T4 C2T2 R2C1T1 Query: (R1, C3)

8 Multi-column Bloom Filters (HBASE-2794) ROWCOL Bloom filters for multi-column queries RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1 C1T3 C2T3 C3T2 R2 C1T2 C2T3 RowColTS R1 C1T4 C2T2 R2C1T1 Query: C1 and C3 in all rows

9 Multi-column Bloom Filters (HBASE-2794) ROWCOL Bloom filters for multi-column queries RowColTS R1C1T2 R1C1T1 R1C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 Query: C1 and C3 in all rows—seek to (R1, C1)

10 Multi-column Bloom Filters (HBASE-2794) ROWCOL Bloom filters for multi-column queries RowColTS R1C1T2 R1C1T1 R1C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 Query: C1 and C3 in all rows—seek to (R1, C3) Fake key: (R1, end of C3)

11 Multi-column Bloom Filters (HBASE-2794) ROWCOL Bloom filters for multi-column queries RowColTS R1C1T2 R1C1T1 R1C2T1 R2C1T1 R2C2T2 R2C2T1 R2C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2C1T2 R2C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 Query: C1 and C3 in all rows—seek to (R2, C1) (R2, C1, T1) (R2, C1, T2) wins by timestamp

12 Multi-column Bloom Filters (HBASE-2794) ROWCOL Bloom filters for multi-column queries RowColTS R1C1T2 R1C1T1 R1C2T1 R2C1T1 R2C2T2 R2C2T1 R2C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2C1T2 R2C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 Query: C1 and C3 in all rows—seek to (R2, C3) (R2, C3, T1) Fake key: (R2, end of C3)

13 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1 C1T3 C2T3 C3T2 R2 C1T2 C2T3 RowColTS R1 C1T4 C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Fake key: (R1, C1, T3) Fake key: (R1, C1, T2) Fake key: (R1, C1, T4)

14 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1 C1T3 C2T3 C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Fake key: (R1, C1, T3) Fake key: (R1, C1, T2) (R1, C1, T4)

15 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Fake key: (R1, C3, T3) Fake key: (R1, C3, T2) Fake key: (R1, C3, T4)

16 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Fake key: (R1, C3, T3) Fake key: (R1, C3, T2) (R2, C1, T1)

17 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 (R1, C3, T2) is next Fake key: (R1, C3, T2) (R2, C1, T1)

18 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2 C1T2 C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Fake key: (R2, C1, T3) To be selected next. Fake key: (R2, C1, T2) (R2, C1, T1)

19 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2C1T2 R2C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 (R2, C1, T2) wins by timestamp Fake key: (R2, C1, T2) (R2, C1, T1)

20 Lazy Seek (HBASE-4465) RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2C1T2 R2C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Fake key: (R2, C3, T3) Fake key: (R2, C3, T2) Fake key: (R2, C3, T4) Optimizing for reading recent data

21 Lazy Seek (HBASE-4465) Optimizing for reading recent data RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2C1T2 R2C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 Real seek to (R2, C3, T3) Fake key: (R2, C3, T2) EO F

22 Lazy Seek (HBASE-4465) RowColTS R1 C1 T2 T1 C2T1 R2 C1T1 C2 T2 T1 R2C3T1 RowColTS R1C1T3 R1C2T3 R1C3T2 R2C1T2 R2C2T3 RowColTS R1C1T4 R1C2T2 R2C1T1 T1 – T2 T2 – T3 T1 – T4 EOF (R2, C3, T1) EO F Optimizing for reading recent data

23 Top-of-the-row seek Some applications do not use DeleteFamily ▪ We always seek to the top of the row first ▪ DeleteFamily comes before all columns, i.e. at (R1, empty column) ▪ Even if we only need (R1, C1), there might be a DeleteFamily for R1 ▪ Some applications do not even use DeleteFamily ▪ Two fixes by Liyin Tang: ▪ Utilize existing ROWCOL Bloom filter (HBASE-4469) ▪ Added a separate ROW-only Bloom filter for DeleteFamily(HBASE- 4532)

24 Seek on deleted KV (HBASE-4585) What if the requested column has been deleted? ▪ We are requesting C1, C2,..., Cn ▪ What if we see a delete marker for Ci? ▪ Previously, we would keep calling next() ▪ Now, we seek to (i + 1)’th requested column (also a fix by Liyin)

25 Data block read requests (dark launch) Thu, Sep 15 – Sun, Sep Pushed on Tue Sep 20 th : No extra next when done with column/row (HBASE-4433) No KV prefetch (HBASE-4434) Lazy Seek (HBASE-4465) Fri Sep 16 th vs. Sep 23 rd : 45% savings in logical block read requests (cache hits + misses)

26 Data block read requests (dark launch) Sun, Sep 25 – Mon, Oct Pushed on Fri Sep 30 th : Avoid top-of-the-row seek (HBASE-4469, Liyin) Off-peak compactions (HBASE- 4463, Karthik) Sun Sep 25 th vs. Oct 2 nd : 33% savings in logical block read requests (cache hits + misses)

27 Data block cache misses (dark launch) ▪ 20.6 K (Mon Sep 19 th ) -> 11.8 K (Mon Sep 26 th ) -> 9.8 K (Mon Oct 3 rd ) ▪ 52% savings (42% and then 17% more) No next KV prefetch No next() when done with row/column Lazy Seek No top-of-the-row seek Off-peak compactios

28 Avoid loading previous block (HBASE-4443) We sometimes go to previous block on exact match ▪ Future work ▪ Suppose the first key of a block matches (Row, Column) ▪ But maybe there is an earlier key that would also match? ▪ We load the previous block to find out ▪ Possible fixes: ▪ Track deletes and optimize the MAX_VERSIONS=1 case ▪ Add last key in block to index (increases index size)

29 Top-of-the-column seek (HBASE-4962) Some applications do not use DeleteColumn ▪ Future work ▪ DeleteColumn deletes all versions of a particular column ▪ Comes before all Puts for a (Row, Column) ▪ Slows down timestamp range queries ▪ Proposed solution: ▪ Add a (Row, Column) Bloom filter for DeleteColumn only ▪ Seek to (Row, Column, T2) for a [T1, T2] range query

30 (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0


Download ppt "Optimizing HBase scanner performance Mikhail Bautin Software Engineer 01/19/2012."

Similar presentations


Ads by Google