Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won.

Similar presentations


Presentation on theme: "Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won."— Presentation transcript:

1 Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won 2 1 Ulsan National Institute of Science and Technology, Korea 2 Hanyang University, Korea LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA

2 USENIX FAST ‘14, Santa Clara, CA Outline  Motivation Journaling of Journal Anomaly  How to resolve Journaling of Journal anomaly Multi-Version B-Tree (MVBT)  Optimizations of Multi-Version B-Tree Lazy Split Reserved Buffer Space Lazy Garbage Collection Metadata Embedding Disabling Sibling Redistribution  Evaluation  Demo 2

3 USENIX FAST ‘14, Santa Clara, CA Android is the most popular mobile platform 3

4 USENIX FAST ‘14, Santa Clara, CA Storage in Android platform Performance Bottleneck 4 1 0 0 1 0 1 01 0 1 0 0 1 1 1 1 0 0 0 1 0 Major cause for device break down. Cause = Excessive IO

5 USENIX FAST ‘14, Santa Clara, CA Android I/O Stack EXT4 Insert/Update/ Delete/Select Read/Write 5 SQLite Misaligned Interaction Block Device Driver Read/Write

6 USENIX FAST ‘14, Santa Clara, CA SQLite insert (Truncate mode) SQLite insert Open rollback journal. Record the data to journal. Put commit mark to journal. Insert entry to DB Truncate journal. 6

7 USENIX FAST ‘14, Santa Clara, CA EXT4 Journaling (ordered mode) EXT4 SQLite Write data block Write EXT4 journal Write journal metadata Write journal commit 7 write()

8 USENIX FAST ‘14, Santa Clara, CA Journaling of Journal Anomaly EXT4 SQLite insert 8 One insert of 100 Byte  9 Random Writes of 4KByte Write data block Write EXT4 journal Write journal metadata Write journal commit

9 USENIX FAST ‘14, Santa Clara, CA How to Resolve Journaling of Journal? EXT4 SQLite insert Database Journaling Database Insertion 9

10 USENIX FAST ‘14, Santa Clara, CA How to Resolve Journaling of Journal? fsync() Database Journaling Database Insertion Journaling + Database = Version-based B-Tree (MVBT) 10 Reducing the number of fsync() calls

11 USENIX FAST ‘14, Santa Clara, CA Version-based B-Tree (MVBT) 10 [T1, ∞) T1 Insert 10Update 10 with 20 T2 10 [T1, T2) 20 [T2, ∞) Time Do not overwrite old version data Do not need a rollback journal 11 10 [T1, T2)

12 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 12 5 [3~∞ ) 10 [2~∞ ) 12 [4~∞ ) 40 [1~∞ ) Page 1 Key = 25 = 5

13 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 13 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 Dead Node Key = 25 = 5

14 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 14 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 Key = 25 = 5

15 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 15 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 Key = 25 = 5

16 USENIX FAST ‘14, Santa Clara, CA Node Split in Multi-Version B-Tree 16 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 25 [5~ ∞ ) 10 [5~ ∞ ) : Page 3 ∞ [5~ ∞ ) : Page 2 Page 4 ∞ [0~5) : Page 1

17 USENIX FAST ‘14, Santa Clara, CA 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 Node Split in Multi-Version B-Tree 17 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 10 [5~ ∞ ) : Page 3 ∞ [5~ ∞ ) : Page 2 Page 4 25 [5~ ∞ ) dirty One more dirty page than original B-Tree ∞ [0~5) : Page 1

18 USENIX FAST ‘14, Santa Clara, CA I/O Traffic in MVBT DB buffer cache fsync() 18 Reduce the number of dirty pages! MVBT

19 USENIX FAST ‘14, Santa Clara, CA I/O Traffic in MVBT MVBT LS -MVBT DB buffer cache fsync() DB buffer cache fsync() 19

20 USENIX FAST ‘14, Santa Clara, CA Optimizations in Android I/O Write Read t write t read >> Write Transaction 1 Write Transaction 1 Write Transaction 2 Write Transaction 2 Write Transaction 3 Write Transaction 3 20 Single Insert Transaction Multiple Insert

21 USENIX FAST ‘14, Santa Clara, CA Optimizations LS-MVBT Lazy Split Multi-Version B-Tree (LS-MVBT) Lazy Split Disabling Sibling Redistribution Metadata Embedding Lazy Garbage Collection Reserved Buffer Space 21

22 USENIX FAST ‘14, Santa Clara, CA  Legacy Split in MVBT Optimization1: Lazy Split 22 5 [3~5) 10 [2~5) 12 [4~5) 40 [1~5) Page 1 5 [5~ ∞ ) 10 [5~ ∞ ) Page 3 12 [5~ ∞ ) 40 [5~ ∞ ) Page 2 10 [5~ ∞ ) : Page 3 ∞ [5~ ∞ ) : Page 2 Page 4 25 [5~ ∞ ) dirty ∞ [0~5) : Page 1 4 dirty pages

23 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 23 5 [3~∞ ) 10 [2~∞ ) 12 [4~∞ ) 40 [1~∞ ) Page 1 Key = 25 = 5

24 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 24 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 Lazy Node: Half-dead, Half-live Lazy Node: Half-dead, Half-live Key = 25 = 5

25 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 25 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 Key = 25 = 5

26 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 26 12 [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) Page 1 12 [4~5 ) 40 [1~5 ) Key = 25 = 5

27 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split 27 12 [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ )

28 USENIX FAST ‘14, Santa Clara, CA Optimization1: Lazy Split  Lazy Node Overflow 28 Dead entries 12 [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) Key = 8 = 6

29 USENIX FAST ‘14, Santa Clara, CA  Lazy Node Overflow → Garbage collect dead entries Optimization1: Lazy Split 29 12 [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) Key = 8 = 6

30 USENIX FAST ‘14, Santa Clara, CA  Lazy Node Overflow → Garbage collect dead entries Optimization1: Lazy Split 30 5 [3~∞ ) 8 [6~∞ ) 10 [2~∞ ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) But, what if the dead entries are being accessed by other transactions? Key = 8 = 6

31 USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space  Option 1. Wait for read transactions to finish 31 12 [5~∞ ) 40 [5~∞ ) Page 2 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) Key = 8 = 6

32 USENIX FAST ‘14, Santa Clara, CA Optimization2: Reserved Buffer Space  Option 2. Split as in legacy MVBT split 32 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) Key = 8 = 6 5 [6~∞ ) 10 [6~∞ ) Page 4 5 [3~6) 1 0 [2~6) 12 [4~5 ) 40 [1~5 ) Page 1

33 USENIX FAST ‘14, Santa Clara, CA 5 [6~∞ ) 8 [6~∞ ) Page 4 Optimization2: Reserved Buffer Space  Option 2. Split as in legacy MVBT split 33 5 [3 ~6) 10 [2 ~6) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~6 ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) 10 [6 ~ ∞ ) : Page 3 10 [6~∞ )

34 USENIX FAST ‘14, Santa Clara, CA  Option 3. Pad some buffer space in tree nodes Optimization2: Reserved Buffer Space Buffer space is used when lazy node is full 34 40 [5~∞ ) Page 2 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 25 [5~∞ ) 12 [5~∞ ) 8 [6~ ∞ ) If buffer space is also full, split as in legacy MVBT Key = 8 = 6

35 USENIX FAST ‘14, Santa Clara, CA Similar to rollback in MVBT Rollback in LS-MVBT 35 Txn 5 crashes 12 [5~∞ ) 40 [5~∞ ) Page 2 25 [5~∞ ) Rollback Number of dirty nodes touched by rollback of LS- MVBT is also smaller than that of MVBT. 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 10 [5~∞ ) : Page 1 ∞ [0~5 ) : Page 1 ∞ [5 ~ ∞ ) : Page 2 Page 3 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 5 [3~∞ ) 10 [2~∞ ) 12 [4~5 ) 40 [1~5 ) Page 1 ∞ [0~5 ) : Page 1 Page 3 5 [3~∞ ) 10 [2~∞ ) 12 [4~∞ ) 40 [1~ ∞ ) Page 1 ∞ [0~∞ ) : Page 1 Page 3

36 USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Periodic Garbage Collection 36 P1 P2 P3 P1 Transaction buffer cache Transaction 1

37 USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Periodic Garbage Collection 37 Garbage Collection P1 GC buffer cache Transaction 1 P2 P3 Stopped P1 P2 P3 fsync() Extra Dirty Pages Transaction buffer cache

38 USENIX FAST ‘14, Santa Clara, CA Optimization3: Lazy Garbage Collection Lazy Garbage Collection 38 P2 P3 Transaction buffer cache Transaction 1 fsync() P1 Do not garabge collect if no space is needed Insert No Extra Dirty Page by GC

39 USENIX FAST ‘14, Santa Clara, CA Version = “File Change Counter” in DB header page 25 [5~∞) 40 [1~∞) 15 [6~∞) Optimization4: Metadata embedding 5 Header Page 1 Page 2... Write Transaction Write Transaction 6 Read Transaction Read Transaction 5 6 Commit dirty 39 Page N ++ 2 dirty pages

40 USENIX FAST ‘14, Santa Clara, CA RAMDISK Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Write Transaction Write Transaction 6 Read Transaction Read Transaction 5 Commit 5 40 25 [5~∞) 40 [1~∞) 15 [6~∞) Header Page 1 Page 2... Page N 5 5 6 6

41 USENIX FAST ‘14, Santa Clara, CA Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK RAMDISK Volatile 6 6 41 Write Transaction Write Transaction 6 Read Transaction Read Transaction 5 Commit 25 [5~∞) 40 [1~∞) 15 [6~∞) Header Page 1 Page 2... Page N 6 dirty 5 1 dirty page CRASH

42 USENIX FAST ‘14, Santa Clara, CA Optimization5: Disable Sibling Redistribution  Sibling redistribution hurts insertion performance Disable sibling redistribution → avoid dirtying 4 pages Page 4: 5 10 Page 3: Page 2: 55 ∞ Insertion of key 20 12 25 15 40 20 dirty 4 dirty pages 42 dirt y Page 5: 10 : Page 4 40: Page 3 ∞ : Page 2 25 12 Page 5: 10 : Page 4 40: Page 3 ∞ : Page 2 40 10 dirty

43 USENIX FAST ‘14, Santa Clara, CA Page 4 5 10 Optimization5: Disable Sibling Redistribution  Disabled sibling redistribution 43 10 : Page 4 55 90 Insertion of key 20 12 25 15 40 40: Page 3 90 : Page 2 Page 5: 20 15 : Page 5 dirty 3 dirty pages Page 5 : dirty Page 2 Page 3: dirty

44 USENIX FAST ‘14, Santa Clara, CA Summary Optimizations 44 Lazy Split Disabling Sibling Redistribution Metadata Embedding Lazy Garbage Collection Reserved Buffer Space LS-MVBT Avoid dirtying an extra page when split occurs Reduce the probability of node split Delete dead entries on the next mutation Avoid dirtying header page Do not touch siblings to make search faster

45 USENIX FAST ‘14, Santa Clara, CA Performance Evaluation  Implemented LS-MVBT in SQLite 3.7.12  Testbed: Samsung Galaxy-S4 (JellyBean 4.3) Exynos 5 Octa 5410 Quad Core 1.6GHz 2GB Memory 16GB eMMC  Performance Analysis Tools Mobibench MOST 45

46 USENIX FAST ‘14, Santa Clara, CA LS-MVBT fsync WAL fsync() MVBT fsync() B-tree computation Analysis of insert Performance WAL Checkpointing Interval SQLite Insert (Response Time) LS-MVBT: 30~78% faster than WAL mode 46

47 USENIX FAST ‘14, Santa Clara, CA Analysis of insert Performance LS-MVBT flushes only one dirty page 47 SQLite Insert (# of Dirty Pages per Insert)

48 USENIX FAST ‘14, Santa Clara, CA Analysis of Block I/O Behavior 10 insertions (LS-MVBT)10 insertions (WAL ) 100 insertions (WAL)100 insertions (LS-MVBT) 20 block accesses 34 block accesses 148 block accesses 218 block accesses 48 SQLite Insert (# of Accesses to Block Device)

49 USENIX FAST ‘14, Santa Clara, CA I/O Traffic at Block Device Level 49 LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL) x3 longer life time of NAND flash SQLite Insert (I/O Traffic per 10 msec) 1000 insertions (LS-MVBT)1000 insertions (WAL )

50 USENIX FAST ‘14, Santa Clara, CA  How about Search performance? LS-MVBT is optimized for write performance. 50

51 USENIX FAST ‘14, Santa Clara, CA Search Performance LS-MVBT wins unless more than 93% of transactions are search 51 Transaction Throughput with varying Search/Insert ratio

52 USENIX FAST ‘14, Santa Clara, CA How about recovery latency? WAL mode exhibits longer recovery latency due to log replay. How about LS-MVBT? 52

53 USENIX FAST ‘14, Santa Clara, CA Recovery LS-MVBT is x5~x6 faster than WAL. 53 Recovery Time with varying # of insert operations to rollback

54 USENIX FAST ‘14, Santa Clara, CA How much performance improvement for each optimization? 54

55 USENIX FAST ‘14, Santa Clara, CA Effect of Disabling Sibling Redistribution Disabling sibling redistribution ↓ 20% faster insertion 55 SQLite Insert: (Response Time and # of Dirty Pages)

56 USENIX FAST ‘14, Santa Clara, CA Performance Quantification 56 Throughput of SQLite: Combining All Optimizations 40% improvement 51% improvement 70% improvemnet X1.7 performance improvement against WAL mode LS-MVBT: 704 insertions/sec WAL : 417 insertions/sec 1220% improvemnet X13.2 performance improvement against Truncate mode LS-MVBT: 704 insertions/sec Truncate : 53 insertions/sec

57 USENIX FAST ‘14, Santa Clara, CA Query response time 57 Response time of MVBT is 80% of WAL Response time of LS-MVBT is 58% of WAL

58 USENIX FAST ‘14, Santa Clara, CA Conclusion  LS-MVBT resolves the Journaling of Journal anomaly via Reducing the number of fsync()’s with Version based B-tree. Reducing the overhead of single fsync() via lazy split, disabling sibling redistribution, metadata embedding, reserved buffer space, lazy garbage collection.  LS-MVBT achieves 70% performance gain against WAL mode (417 ins/sec  704 ins/sec) solely via software optimization. 58

59 USENIX FAST ‘14, Santa Clara, CA Thank you…


Download ppt "Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won."

Similar presentations


Ads by Google