Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database I/O Mechanisms

Similar presentations


Presentation on theme: "Database I/O Mechanisms"— Presentation transcript:

1 Database I/O Mechanisms
Performance and persistence Richard Banville Fellow, OpenEdge Development Progress Software

2 1 2 3 4 5 Agenda Database I/O Types User Data I/O Recovery Data I/O
Other I/O 5 Summary

3 File Write I/O for File Types
Logical vs Physical Database request vs OS I/O Database I/O vs O/S I/O Physical I/O always uses file system cache (no raw I/O) Buffered vs unbuffered I/O Unbuffered I/O considered durable after write system call Recovery data with integrity User data with -directio Buffered I/O requires file system sync. for durability Recovery data with no integrity User data

4 OpenEdge I/O & The File System
System Memory Process Shared Memory Database Buffer Pool BI Buffers AI Buffers I/O via F/S cache File system cache Physical Disk Devices Multi level caches Multi level caches .d .d .b .d .a

5 OpenEdge Data I/O & The File System
System Memory Buffered I/O to F/S cache F/S decides when to write to disk device Disk device decides when to write to physical disk At checkpoint, made durable via fdatasync() / FlushFileBuffers() Required for crash recovery and Bi space reuse to work properly Promon Checkpoints: Process Shared Memory Database Buffer Pool File system cache Sync Time: cost of fdatasync() Duration: cost of cluster close => 1) flush bi bufs, 2) scan buffer pool 2a) flush bufs from prev checkpoint 2b) put dirty bufs on checkpoint queue 3) flush ai bufs 4) Synch file system cache - fdatasync() Disk Devices Flushes Duration Sync Time 0.20 0.02 4 0.04 0.17 2 0.22 0.03 .d Multi level caches

6 OpenEdge –directio I/O & The File System
System Memory -directio Unbuffered I/O thru F/S cache Not raw I/O to disk device Each I/O sync‘d to disk device Operational affects No need to sync at checkpoint Write I/O more expensive Additional cost to page writers Promon Checkpoints: Process Shared Memory Database Buffer Pool File system cache -directio is NOT what the name would imply. Data written to F/S then immediately synced to disk Disk Devices Flushes Duration Sync Time 0.16 0.00 2 0.18 11 .d Multi level caches

7 OpenEdge –directio I/O Performance
System Memory How could the more expensive writes of –directio improve performance? APWs absorb the additional cost If they do all the writing without adding OLTP contention Lower checkpoint costs Each I/O sync‘d to disk device No sync needed during checkpoint Higher throughput due to less pause May help on inadequate file system Less useful for Well tuned deployments Properly sized systems When buffers flushed at checkpoint Process Shared Memory Database Buffer Pool File system cache -directio is NOT what the name would imply. Data written to F/S then immediately synced to disk However, any buffers flushed at checkpoint are MUCH more expensive! Disk Devices .d Multi level caches

8 OpenEdge Recovery I/O & The File System
BI Buffers File system cache System Memory Process Shared Memory Multi level caches .d .b Unbuffered I/O to F/S cache Each I/O sync‘d to disk device For .bi, called “reliable I/O” BI blocks written when: BIW notices full block in out buffer APW writes data block with bi dependancy Broker notices aged commit (-Mf) User can‘t find empty bi block to store update notes User must perform checkpoint Disk Devices

9 OpenEdge Recovery: Making it unreliable
BI Buffers File system cache System Memory Process Shared Memory Multi level caches .d .b Never in production Specific maintenance only -r: BI writes are buffered (un-reliable) to F/S All change notes recorded Rollback will work Crash recovery likely to work Recovery from OS crash will most likely fail idxbuild some index, !“some; !” Disk Devices *** An earlier -r session crashed, the database may be damaged. (514)

10 OpenEdge Recovery: Making it more unreliable
BI Buffers File system cache System Memory Process Shared Memory Multi level caches .d .b Never ever in production Specific maintenance only -i: no-integrity BI writes are buffered No data dependency check (!WAL) No F/S sync at checkpoint No record of purely physical notes Rollback might work OS, DB crash, abnormal termination Must restore from backup Roll back will never occur until after a micro-transaction completes so no record of purely physical notes is OK Disk Devices ** Your database cannot be repaired You must restore a backup copy. (510)

11 1 2 3 4 5 Agenda Database I/O Types User Data I/O Recovery Data I/O
Other I/O 5 Summary

12 … Buffer Pool I/O 1 Buffer pool cache 1 Hash table
Multiple LRU replacement chains Database Buffer Lookup Buffer pool hash table Database Buffer Pool (-B, -B2) 4 160 32 128 64 2 144 192 112 80 LRU buffer eviction policy LRU2 buffer eviction policy If not found via hash table lookup Incur O/S read I/O – “page-in” But where do you read into?

13 … Buffer Pool I/O 1 Buffer pool cache 1 Hash table
Multiple LRU replacement chains Database Buffer Lookup Buffer pool hash table Database Buffer Pool (-B, -B2) C D LRU buffer eviction policy LRU2 buffer eviction policy Start at LRU end of buffer replacement chain Look for first “non-dirty” buffer (to avoid write) Can’t find one after 10 tries? “Page-out” least recently used buffer (O/S write I/O) “LRU writes” May force (multiple) BI/AI writes, usually partial writes! “Page-in” your block to available buffer (O/S read I/O)

14 DB Data Read I/O Tuning Increase performance by decreasing I/O
Avoiding read I/O Large buffer pool (-B) Utilize alternate buffer pool (-B2) Improve queries; Avoid table scans; Cache data locally Private “read-only” buffers (–Bp), utilities too! Increase pool when read I/O unacceptable for properly tuned application Too many buffers may cause O/S paging Decrease file system cache Avoid non-essential activities on production server Consider buying more memory Database Buffer Pool -B & -B2 buffers I/O DB From the DBA side of things, its always ease to increase –B. However, it will not improve performance for a poorly written application

15 Data I/O Performance Monitoring - Promon
Promon R&D => Performance indicators Promon R&D => Buffer cache O/S reads and O/S writes Flushed at checkpoint LRU Writes APW enqueues* Database Buffer Pool -B & -B2 buffers I/O DB What about buffer pool hit ratio % (BHR)? Too easily skewed by bad queries Not a fine enough metric (hits / requests) 270,000 database read requests / second Buffer hit ratio % of 98 Still means 5,400 O/S Read I/Os per second! Fast F/S access still 75x slower than -B A low BHR indicates a poorly tuned system Low BHR indicates a poorly running system. 75x slower stat from Tom Bascom analysis Great article to read more about why buffer hit ratios are not affective A high BHR does not denote a well tuned system

16 DB Data Write I/O Tuning Increase performance by decreasing I/O
Avoiding write I/O Large buffer pool lessens forced “page-outs” Improve queries in the application Reduce checkpoint frequency (see next section) Run with APWs (Have someone else do it!) Avoids user and server writes Decreases LRU writes (forced “page-outs”) Reduces checkpoint time Performs DB buffer pool I/O May flush AI and BI data Database Buffer Pool -B & -B2 buffers I/O DB 75x slower stat from Tom Bascom analysis

17 Asynchronous Page Writer Activities
Checkpoint Queue Primary –B buffer pool AND Alternate –B2 buffer pool C D LRU chains 4 148 200 120 BI WAL APW DB APW Queue #2 Forced bi write only if cluster > 95% full #3 #1 Discuss Self tuning mechanism Checkpoint Q writes can cause forced bi writes (resulting in partial BI writes) If the cluster is 95% full, write will occur regardless – this can cause partial writes as well This is why increasing cluster size can reduce partial bi writes – it is 95% full less of the time! New adaptive mechanism for checkpoint processing Avoids buffers flushed 10.2b FCS

18 Asynchronous Page Writer Performance
Checkpoint Queue Primary –B buffer pool AND Alternate –B2 buffer pool R U LRU chains 4 148 200 120 BI WAL APW DB APW Queue Promon R&D => Page Writers APW queue writes Checkpoint queue writes Buffers scanned Scan writes Tuning Increase until 0 blocks flushed at checkpoint Decrease if partial BI writes increase Increasing BI cluster size can avoid: partial BI writes forcing BI writes (95% full less of the time) Typically need more if running with Direct I/O Discuss Self tuning mechanism – buffers flushed only when 95% of cluster is full New adaptive mechanism changes number of buffers written from checkpoint queue based on # of flushes in previous checkpoint. Checkpoint Q writes can cause forced bi writes (resulting in partial BI writes) If the cluster is 95% full, write will occur regardless – this can cause partial writes as well This is why increasing cluster size can reduce partial bi writes – it is 95% full less of the time!

19 1 2 3 4 5 Agenda Database I/O Types User Data I/O Recovery Data I/O
Other I/O 5 Summary

20 BI BI Buffer Pool -bibufs 10 Modified Queue Backout Buffer
Forward Processing Rollback Processing -bibufs 10 Modified Queue Current Output Buffer Current Input Buffer Free List Free(a) 32 31 15 Free(b) 30 Backout Buffer Backout Buffer New Notes (Actions) Free(c) 29 9 12 Free(d) For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability Free(e) BI

21 BI Buffer Pool – Recording a change
Forward Processing -bibufs 10 Modified Queue Current Output Buffer Empty buffer waits Free List Free(a) 32 31 Busy buffer waits BIB latch contention -bwdelay in ms (30ms) Nap time when nothing dirty Not much positive tuning affect Free(b) 30 New Notes (Actions) Free(c) 29 User Free(d) For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability B I W Free(e) BI

22 BI Buffer Pool – Forced Write I/O
Forward Processing Data Blocks -bibufs 10 Modified Queue Current Output Buffer 256 Free List 512 Free(a) 32 31 Buffer Pool Associated BI note dependency ctr (based on fill %) 768 Free(b) 30 New Notes (Actions) 172 Checkpoint Queue Free(c) 29 User 128 Free(d) APW WAL For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability DB Free(e) BI

23 BI Buffer Pool – Write I/O
Forward Processing Is it OK to buffer modified BI blocks? -bibufs 10 Modified Queue YES Current Output Buffer Free List Is it OK to buffer committed BI data? Free(a) 32 31 Delayed commit (-Mf) is up to you! Free(b) 30 New Notes (Actions) Free(c) 29 User Delayed commit (Durability) Based on –Mf value, Broker may flush BI buffers to disk For aged txn ends -Mf default 3 Increasing -Mf Pros/Cons: Free(d) For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability Broker Free(e) BI

24 BI Buffer Pool – Change rollback
Forward Processing Rollback Processing -bibufs 10 Modified Queue Current Output Buffer Current Input Buffer Free List 1 shared input buffer Multiple private back out buffers Free(a) 32 31 15 Free(b) 30 New Notes (Actions) Backout Buffer Backout Buffer Free(c) 29 9 12 Free(d) For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability Free(e) BI

25 BI Buffer Pool – Change rollback
Rollback Processing Modified Queue Current Output Buffer Current Input Buffer 32 31 15 30 Read I/O to find notes Write I/O when undoing Promon: BI Reads Input buffer hits Output buffer hits Mod buffer hits BO buffer hits Back out Buffer Back out Buffer 29 9 12 For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. If note is in current out or mod buffer, only the note is copied, not the entire buffer. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability BI

26 Tuning the Bi Buffer Pool
Forward Processing Run BIW Promon: 5. BI Log Activity Empty buffer waits – all full Increase –bibufs (online) -aibufs >= -bibufs Start with –bibuf 150 Partial (forced) writes -Mf expired Increase if not risk adverse Too many APWs Tune checkpoint processing Busy buffer waits – busy - OK Log force waits/write – 2PC commit -bibufs 10 Current Output Buffer Modified Queue Free List Free(a) 32 31 Free(b) 30 New Notes (Actions) Free(c) 29 User Free(d) For performance we buffer all I/O activity. The BI is no different. Buffering BI activity maintains the Durability rule as long as when the txn commits all notes associated with the transaction are flushed to disk. NF – Not Formatted Blocks from Free list are formatted in memory for disk addressability B I W Free(e) BI

27 Monitoring BI Activity & Performance Summary
Forward Activity Total BI writes Records (notes) written Clusters closed Undo Total BI reads Notes read Input buffer hits Output buffer hits Mod buffer hits BO Buffer Hits OK Waits & Writes Busy buffer waits BIW writes Bad Waits & Writes Empty buffer waits Partial writes Forced writes (2PC) Flushed at checkpoint Checkpoint duration (wait) Forced Writes (tpflCt64) – 2PC only

28 Checkpoint Processing

29 Checkpoint Processing
Quiet DB Database changes halted Page writers continue Flush bibufs Output, Mod buffers May cause 1 partial write Scan buffer pool Write bufs on chkpt queue Dirty buffs added to chkpt queue “Fuzzy” checkpoint Hopefully flushed prior to next chkpt Flush aibufs Sync File System F/S Sync system call No more sync delay Resume database activity Dirty buffers are marked for chkpt & put on checkpoint queue Fuzzy check pointing avoids I/O Hopefully APWs flush these buffers before the next checkpoint

30 Promon Checkpoint Data
Database Writes No. Time CPT Q Scan APW Q Flushes (Cont.) 27 10:23:12 384 52 26 10:22:46 381 3 25 10:22:18 380 2 24 10:21:50 201 158 APW Specific Activity… CPT Q: # data buffers APW wrote from checkpoint queue (from prev chkpt) Dirty - # of dirty blocks in -B - marked modified APW Q - NOT GOOD! May need to increase -B or add APW Scan: # data buffers APW wrote while scanning -B APW Q: # data buffers APW wrote from APW Q Dirty buffers added to APWQ from -B LRU eviction

31 Promon Checkpoint Data
Database Writes No. Time CPT Q Scan APW Q Flushes (Cont.) 27 10:23:12 384 52 26 10:22:46 381 3 25 10:22:18 380 2 24 10:21:50 201 158 Flushes: Number of database blocks written during checkpoint Very costly operation (db updates paused) Should add ai/bi flushes Marked from previous checkpoint Avoid with APWs and larger cluster sizes Dirty - # of dirty blocks in -B - marked modified APW Q - NOT GOOD! May need to increase -B or add APW

32 Promon Checkpoint Data
----- New Columns ----- No. Time Duration Sync Time 27 10:23:12 0.12 0.04 26 10:22:46 0.11 0.03 25 10:22:18 24 10:21:50 0.13 File System Cache DB Duration: Time to process checkpoint including: Write chkpt queue, buffer pool scan, bi/ai flush, F/S Sync Sync Time: Amount of time in seconds it took for fdatasync() or FlushFileBuffers() Limit file system cache size and flush frequency Faster disks for data files Avoid with –directio (but increases all write I/Os)

33 Tuning Checkpoint Processing
Physical BI truncate Values in K -bi (cluster size in KB) -biblocksize (size in KB) Before-image block size set to 8 or 16 kb Followed by sync command proutil <db> -C truncate bi -biblocksize 8 -bi 8192 proutil <db> -C bigrow 8 -r Runtime BI bufs BIW

34 Summary: Recovery Subsystem
AI/BI buffers No LRU replacement mechanism Database changes recorded orderly Forward processing causes BI write I/O Rollback may cause read I/O Backout Buffers (BOB) help rollback contention Checkpoints Buffers flushed during checkpoint Page writers BIW/AIW processing APW processing

35 1 2 3 4 5 Agenda Database I/O Types User Data I/O Recovery Data I/O
Other I/O 5 Summary

36 Database Extend Database extend Recovery extend (AI/BI)
Storage area locked - no other extends Writes performed 16K at a time Extend by 64 blocks or cluster size Recovery extend (AI/BI) Acquire space from F/S Unbuffered write Bi grow after truncate Performance Improvements F/S interaction for extent create 11.3 BI extend, format & grow in 11.3 Maintenance cost Performance Concurrency Frequency

37 Monitoring I/O With Promon R&D
Database Accesses vs File I/O Database writes O/S Writes 2. Activity Displays ... 1. Summary 3. Buffer Cache 4. Page Writers 5. BI Log / 6. AI Log 8. I/O Operations by Type 9. I/O Operations by File 3. Other Displays… 1. Performance Indicators 2. I/O Operations by Process 4. Checkpoints 5. I/O Operations by User by Table 6. I/O Operations by User by Index I/O Operations by user are NOT operating system I/Os – they are logical database operations.

38 1 2 3 4 5 Agenda Database I/O Types User Data I/O Recovery Data I/O
Other I/O 5 Summary

39 Summary Always uses file system cache (no raw I/O) Buffered vs unbuffered I/O User data files: .d’s and recovery files: .ai, .bi, .tl I/O Types Checkpoint process Page writers (APW, BIW, AIW) Data and recovery I/O Monitor via promon, VSTs and OS tools Tuning tips Performance

40 ? Questions

41 www.progress.com/exchange-pug October 6–9, 2013 • Boston #PRGS13
Special low rate of $495 for PUG Challenge attendees with the code PUGAM And visit the Progress booth to learn more about the Progress App Dev Challenge!

42

43 File Write I/O for File Types
I/O always uses file system cache (no raw I/O) Buffered vs unbuffered I/O Unbuffered I/O considered durable after write system call Buffered I/O requires file system sync. for durability User data files: <db>_<area>.d’s (table, index and LOB data) Changes recorded in .bi/.ai for undo/redo purposes Updates use buffered I/O Can be overridden to use “unbuffered I/O” mechanism (-directio) Recovery files: .ai, .bi, .tl (for recovery to work…) Updates recorded using “unbuffered I/O” mechanism Always written before user data (WAL rule) Must be durable on disk when written File system must observe write ordering

44 Other parameters affecting I/O operations
-groupdelay (when –Mf 0) This is really group commit. User waits # ms to write end note so several commits can occur in same bi buffer write Since –Mf 0, each commit would immediately force the bi buffer to be written. -bwdelay Forces BIW to delay # seconds between each bi buffer pool scan

45 Recognizing You Have A Problem
Things are sluggish Improve cache utilization Avoid causes of I/O Physical disk activity File system cache – too large or too slow Users blocked on I/O Promon “Blocked Users” screen Checkpoint duration > 0.50 Excessive I/O Table scans, buffer pool hit ratio Recovery subsystem (reads vs writes) APW, BIW, User writes


Download ppt "Database I/O Mechanisms"

Similar presentations


Ads by Google