Optimizing HBase scanner performance

Optimizing HBase scanner performance
Mikhail Bautin Software Engineer 01/19/2012

HBase Scanners What happens on a Get RegionScanner StoreScanner
ColumnFamily1 ColumnFamily2 StoreScanner StoreScanner Store = (Region, CF) . . . StoreFileScanner . . . StoreFileScanner StoreFileScanner (R1,C1,T3) (R1,C2,T2) (R1,C2,T1) (R1,C1,T1) (R1,C2,T3) (R2,C1,T2) (R2,C2,T1) . . .

HBase Scanner State What happens on a next() RegionScanner
ColumnFamily1 Priority Queue ColumnFamily2 StoreScanner StoreScanner Store = (Region, CF) Priority Queue Priority Queue . . . StoreFileScanner . . . StoreFileScanner StoreFileScanner Current KeyValue Current KeyValue Current KeyValue

Avoiding next() on StoreFileScanner
Every next() call may result in disk I/O HBASE-4433: avoid extra next if done with row/column (Kannan) An optimization for queries specifying a column set INCLUDE_AND_SEEK_NEXT_COL INCLUDE_AND_SEEK_NEXT_ROW HBASE-4434: Don't do HFile Scanner next() unless the next KV is needed (Kannan) Avoid aggressive pre-fetching

Simple ROWCOL Bloom Filters
Do we have to read all of these files? Query: (R1, C3) Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1

Simple ROWCOL Bloom Filters
In some cases, we only have to read one file Query: (R1, C3) Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1

Multi-column Bloom Filters (HBASE-2794)
ROWCOL Bloom filters for multi-column queries Query: C1 and C3 in all rows Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1

ROWCOL Bloom filters for multi-column queries Query: C1 and C3 in all rows—seek to (R1, C1) Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1

ROWCOL Bloom filters for multi-column queries Query: C1 and C3 in all rows—seek to (R1, C3) Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 Fake key: (R1, end of C3) Fake key: (R1, end of C3)

ROWCOL Bloom filters for multi-column queries Query: C1 and C3 in all rows—seek to (R2, C1) Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 (R2, C1, T1) (R2, C1, T2) wins by timestamp (R2, C1, T1)

ROWCOL Bloom filters for multi-column queries Query: C1 and C3 in all rows—seek to (R2, C3) Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 Fake key: (R2, end of C3) Fake key: (R2, end of C3) (R2, C3, T1)

Lazy Seek (HBASE-4465) Optimizing for reading recent data T1 – T2
Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 Fake key: (R1, C1, T4) Fake key: (R1, C1, T3) Fake key: (R1, C1, T2)

Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 (R1, C1, T4) Fake key: (R1, C1, T3) Fake key: (R1, C1, T2)

Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 (R2, C1, T1) Fake key: (R1, C3, T3) Fake key: (R1, C3, T2)

Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 (R2, C1, T1) (R1, C3, T2) is next Fake key: (R1, C3, T2)

Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 (R2, C1, T1) Fake key: (R2, C1, T3) To be selected next. Fake key: (R2, C1, T2)

(R2, C1, T2) wins by timestamp
Lazy Seek (HBASE-4465) Optimizing for reading recent data T1 – T2 T2 – T3 T1 – T4 Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 (R2, C1, T1) (R2, C1, T2) wins by timestamp Fake key: (R2, C1, T2)

Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 EOF Real seek to (R2, C3, T3) Fake key: (R2, C3, T2)

Row Col TS R1 C1 T2 T1 C2 R2 C3 Row Col TS R1 C1 T3 C2 C3 T2 R2 Row Col TS R1 C1 T4 C2 T2 R2 T1 EOF EOF (R2, C3, T1)

Top-of-the-row seek Some applications do not use DeleteFamily
We always seek to the top of the row first DeleteFamily comes before all columns, i.e. at (R1, empty column) Even if we only need (R1, C1), there might be a DeleteFamily for R1 Some applications do not even use DeleteFamily Two fixes by Liyin Tang: Utilize existing ROWCOL Bloom filter (HBASE-4469) Added a separate ROW-only Bloom filter for DeleteFamily(HBASE- 4532)

Seek on deleted KV (HBASE-4585)
What if the requested column has been deleted? We are requesting C1, C2, ..., Cn What if we see a delete marker for Ci? Previously, we would keep calling next() Now, we seek to (i + 1)’th requested column (also a fix by Liyin)

Data block read requests (dark launch)
Thu, Sep 15 – Sun, Sep Fri Sep 16th vs. Sep 23rd: 45% savings in logical block read requests (cache hits + misses) Pushed on Tue Sep 20th: No extra next when done with column/row (HBASE-4433) No KV prefetch (HBASE-4434) Lazy Seek (HBASE-4465)

Data block read requests (dark launch)
Sun, Sep 25 – Mon, Oct Sun Sep 25th vs. Oct 2nd: 33% savings in logical block read requests (cache hits + misses) Pushed on Fri Sep 30th: Avoid top-of-the-row seek (HBASE-4469, Liyin) Off-peak compactions (HBASE-4463, Karthik)

Data block cache misses (dark launch)
20.6 K (Mon Sep 19th) -> 11.8 K (Mon Sep 26th) -> 9.8 K (Mon Oct 3rd) 52% savings (42% and then 17% more) No next KV prefetch No next() when done with row/column Lazy Seek No top-of-the-row seek Off-peak compactios

Avoid loading previous block (HBASE-4443)
We sometimes go to previous block on exact match Future work Suppose the first key of a block matches (Row, Column) But maybe there is an earlier key that would also match? We load the previous block to find out Possible fixes: Track deletes and optimize the MAX_VERSIONS=1 case Add last key in block to index (increases index size)

Top-of-the-column seek (HBASE-4962)
Some applications do not use DeleteColumn Future work DeleteColumn deletes all versions of a particular column Comes before all Puts for a (Row, Column) Slows down timestamp range queries Proposed solution: Add a (Row, Column) Bloom filter for DeleteColumn only Seek to (Row, Column, T2) for a [T1, T2] range query

Optimizing HBase scanner performance

Similar presentations

Presentation on theme: "Optimizing HBase scanner performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing HBase scanner performance

Similar presentations

Presentation on theme: "Optimizing HBase scanner performance"— Presentation transcript:

Similar presentations

About project

Feedback