Download presentation
Presentation is loading. Please wait.
2
VERITAS Volume Manager Technical Talk
S201/Best Practices Mike Root Sr. Volume Manager Engineer Sean Derrington Sr. Group Manager, Product Management VERITAS Software
3
VxVM Host based storage virtualization Robust Flexible High performing
Solid foundation for advanced functionality Snapshots Replication Clusterable How does VxVM do it? Robust object layout
4
VxVM Internals Overview
VxVM objects What they are How they affect IO Private region Structure Usage New in Version 4.0 Configuration backup and restore Logging FlashSnap Volume sets
5
Replication Volume Group
VxVM Objects Replication Volume Group Replication Link Volume Set Volume SRL DCO Plex Logonly Plex Snapshot Subcache Subdisk Logsubdisk DCO Volume This is the vxconfigd view of things – or the way the user would manage the objects. The brown round edged objects are specialized objects made from real objects, but used for a specific purpose In the kernel the picture changes. The kernel view can be seen with vxkprint In the kernel the Disk Media and Disk Access become just a voldisk (or disk) In the kernel the logonly plex becomes a separate volume with the logonly plex as the mirrors. The volume name would be vol[log] for drl vol[dcm] for VVR dcm Start with the objects that are real objects –DMP isn’t a object (it is a separate driver) Grey with black letters is VVR Blue is devices that are created in /dev/vx/dsk light blue is the volume that everybody knows dark blue is the new in 4.0 volume set there can be 256 volumes in a volume set (going to 4096 in 5.0) Orange is in the traditional objects that everybody should know. These are also the main IO path Green is FlashSnap DCO is a connector object, connects the snapshot and dco volume to the data volume snapshot will point to another snapshot object so both volumes will know which is a snapshot of which Yellow is Intelligent Storage Provisioning (ISP) Info holds the rules the ISP volumes must follow storage pools group the disks in the diskgroup that ISP can use Dark Grey are specialized objects made out of other real objects logonly plex is a regular plex, but it doesn’t have a data subdisk logsubdisk is the subdisk used for DRL or DCM Others (SRL, DCO vol, Cach vol) are all real volumes associated with another object for a specific job. They can be managed like volumes, but they are not directly accessed by the user application, they are used by the specific feature (VVR, FlashSnap, SpaceOptimized snapshots) Disk Media Cache Disk Access Info Cache Volume Dynamic Multipathing Storage Pool
6
IO Path Basics Users direct IO requests at volume or volume set
vxconfigd creates devices in /etc/vx/[r]dsk/ Device number identifies initial target object for each user request VxVM IO is object based vxconfigd loads the objects into the kernel Volume is found in a hash from device number Each object performs or requests actions required at its level The top objects volumes/vsets are found from a hash for quick lookup Then the IO is passed to the object The object can decide what to do with the IO. If the volume has DRL, then the IO is given to DRL to be done first After DRL is done, the IO comes back to the volume and IO can then be sent to the plexes If the volume is part of an RVG, the IO would be sent to the RVG The RVG would decide to write to the SRL Each object knows how best to resolve the request The volume picks which plex to read from, The plex picks which subdisk(s) to read from The subdisk knows where on the disk to read from. Each object keeps stats Each object keeps a trace
7
Object IO Path Example B A my_vol my_vol[log]
Application write my_vol my_vol[log] my_vol-01 my_vol-02 my_vol-03 my_vol-04 d1-01 d2-01 d3-01 d4-01 B A Mention how the log subdisk is made into a volume in the kernel with [drl] And mention how the dm/da becomes a disk in the kernel. These kernel components can be seen with /etc/vx/vxkprint Talk here about how each object does IO. The volume decides to send to the DRL first Then when DRL is done, it sends to both plexes at the same time If the plex is stripped, the plex can send to both subdisks at the same time. The disk sends the IO down to dmp which would then pick the correct path. d5 d6 d1 d2 d3 d4 Dynamic Multipathing
8
Object Level IO Performance
vxstat displays per-object IO statistics # vxstat vol vol-01 vol-02 c2t0d0s2-01 OPERATIONS BLOCKS AVG TIME(ms) TYP NAME READ WRITE READ WRITE READ WRITE vol my_vol pl my_vol pl my_vol sd c2t0d0s Identify “hotspots” at the object level Resolve by relocating subdisks from busy disks You can do stats on the DRL even You would resolve hot spots by moving a subdisk to a new disk, or possibly splitting a subdisk and moving part of the subdisk to a new disk.
9
Cluster Volume Manager IO Path
my_vol Normal IO Each node’s VxVM instance makes IO requests directly to disks No inter-node messages required Certain operations require messages between nodes Administrative IO (eg. volume relayout, …) Error recovery FlashSnap bitmap updates VVR writes to the SRL Cluster volume manager Shared diskgroups Multiple servers reading/writing the same volume
10
Objects and IO Error Recovery
Issue Plexes must be kept consistent when IO errors occur vxio Marks disk with FAILING flag (Prevents allocation of new objects) Reads sent to a different plex Changes plex kstate to DETACH and updates klog vxconfigd Detaches plex containing failing device (A detached plex remains associated with its volume, but does not participate in IO) Recheck that the writes are retried. Failing flag means that at least one subdisk had an error on this disk Don’t use this disk for new allocations, or don’t relayout/move volumes to this disk. Private header is rewritten to check if the disk is still usable. Kstate and vxconfigd state may be different Mention difference between state and kstate
11
Volume Recovery After a System Crash
Issue: VxVM ensures the volumes are not left in indeterminate states (e.g., inconsistent mirrors) Goal: Start volumes ASAP without loss of consistency Mechanism: read-writeback mode Every block read by a user is rewritten to all plexes Make all plex contents consistent in background Eliminates the possibility of reading different data from different plexes during recovery Used: When starting a mirrored volume When a CVM node fails With DRL, only “in flight” regions need be made consistent DRL can make the recovery faster Read-writeback mode can only be seen in the kernel with vxkprint Either the kflag or the sflag can be set to rwback when read-writeback mode is enabled. Mirrorvol vol: rid= assoc=0.0 update_tid=0.1052 len= poolid=0 rwbackoffset= cdsrecover=1/1 (syncing) ap_recover_seqno: 0 ap_recover_seqno_done: 0 kflag=(enabled|rdwr|round-robin|except-det-sparse|writeback|rwback) sflag=(rwback) guid = { c-1dd2-11b2-a fff5c0} vvr_tag = 0 proxy rid = 0.0
12
VxVM Internals Overview
VxVM objects What they are How they affect IO Private region Structure Usage New in Version 4.0 Configuration backup and restore Logging FlashSnap Volume sets
13
Private Region Overview
Initialize a disk before VxVM can use it Identifies the disk to VxVM with a “unique” diskid A disk initialized for VxVM use has Private region: ………containing VxVM metadata Public region:………..application data storage Objects information stored in the private region Configuration information saved on selected disk Number of disks based on total disks in diskgroup Enabled regions spread across controllers and enclosures Fewer disks means faster configuration changes Type is how vxconfigd does stuff Format is how things look on disk.
14
VxVM Disk Types/Formats
Type/Format Description sliced Private region and public region on different partitions simple Private region and public region on the same partition nopriv Public region, but no private region (Used for RAM disks and encapsulation) cdsdisk Formatted for platform portability of diskgroups New in 4.x Put auto below the table Starting with 4.0, always use cdsdisk format “auto:cdsdisk”
15
Cross-Platform Data Sharing (CDS) Disk Format
New in 4.x Cross-Platform Data Sharing (CDS) Disk Format Allows diskgroups to be moved between different platforms Allows disks to be recognized by all supported UNIX platforms Private regions aligned on 8KB boundaries Can be imported on any supported UNIX platforms Regardless of platform endian format Convert older diskgroup formats vxcdsconvert Mention what is special about CDS,
16
vxconfigd View of the Private Region
# vxdisk list c1t98d0s2 Device: c1t98d0s2 devicetag: c1t98d0 type: auto hostid: anthrax disk: name= id= anthrax group: name=tcrundg id= anthrax info: format=sliced,privoffset=1,pubslice=4,privslice=3 flags: online ready private autoconfig autoimport pubpaths: block=/dev/vx/dmp/c1t98d0s4 char=/dev/vx/rdmp/c1t98d0s4 privpaths: block=/dev/vx/dmp/c1t98d0s3 char=/dev/vx/rdmp/c1t98d0s3 version: 2.1 iosize: min=512 (bytes) max=2048 (blocks) public: slice=4 offset=0 len= disk_offset=7182 private: slice=3 offset=1 len=3334 disk_offset=3591 update: time= seqno=0.23 ssb: actual_seqno=0.0 headers: configs: count=1 len=2431 logs: count=1 len=368 Defined regions: config priv [000231]: copy=01 offset= enabled config priv [002200]: copy=01 offset= enabled log priv [000368]: copy=01 offset= enabled Multipathing information: numpaths: 1 c1t98d0s state=enabled Private Region Header The private region header has all the information that uniquely identifies the disk and allows the disk to be associated with valid information in the configuration records in the config region. Header Hostid Dgid Diskid Update time/sequence number Serial-split-brain
17
Private region internals
Config vxconfigd stores the persistent object information Layout/size of volumes Associations between objects Diskgroup version Klog Kernel logs changes vxconfigd discovers what changed Vxconfigd doesn’t need to be running to do IO VCS uses vxconfigd to identify when things change # /etc/vx/diag.d/vxprivutil dumplog /dev/rdsk/c1t0d0s2 LOG #01 BLOCK 0: KLOG 0 : COMMIT tid=0.1048 BLOCK 0: KLOG 1 : DIRTY rid=0.1029
18
Managing VxVM Private Region Directly vxprivutil
For advanced users and customer support Located in /etc/vx/diag.d Functions Set private region header attributes (e.g., hostid, dgname, …) View diskgroup before importing vxprivutil dumpconfig /dev/rdsk/c3t2d0s2 |vxprint –D - View klog contents vxprivutil dumplog /dev/rdsk/c3t2d0s2 Viewing the private region from the disk can be helpful if the disk has a problem. It can be used to restore the diskgroup, but in 4.0 we have a better way Use to use vxprint –hmtqpl … now use config backup restore # /etc/vx/diag.d/vxprivutil dumplog /dev/rdsk/c1t0d0s2 LOG #01 BLOCK 0: KLOG 0 : COMMIT tid=0.1048 BLOCK 0: KLOG 1 : DIRTY rid=0.1029 Here is the complete list of things that can be set in 4.0 diskid, hostid, dgid, movedgid, dg_name, dgname, shared, autoimport, dgmove, ssbid In some cases these would need to be set on every disk in the dg to be useful. hostid, dgid, dg_name, shared Others need to be unique for the disk diskid And others could be different for each disk in the dg Ssbid, movedgid, dgmove
19
VxVM Internals Overview
VxVM objects What they are How they affect IO Private region Structure Usage New in Version 4.0 Configuration backup and restore Logging FlashSnap Volume sets Should be simple to make, just do it as a stack and build it from top (app) down to the hardware layer.
20
rootdg is no longer required
New in 4.x rootdg is no longer required This Impromptu slide added after VISION Customers really like that rootdg is optional now Reserved diskgroup names bootdg, defaultdg, nodg rootvol can be in any diskgroup The –g option is now required for most commands Use vxdctl defaultdg <dgname> to avoid typing –g <dgname>
21
Configuration Backup and Restore
New in 4.x Configuration Backup and Restore Automatically save a current copy of the private region Configuration changes save a back up in /etc/vx/cbr/bk a binary copy of the config region Disk information vxprint -m Commands to restore the private region Administrative commands to restore diskgroup configuration Restore only to the same hardware Matching hardware id (eg serial number, WWN, lunid…) Operation in a cluster Shared diskgroup backups done on master Private diskgroup backups done where the diskgroup is imported There is actually the previous 5 copies saved in files The user could use vxprivutil on the binary copy of the private region Theoretically, the user could change the vxprint –m output and do a Cat *.
22
Configuration Backup and Restore Commands
vxconfigbackupd automatically backs up configuration after every change Manual backup can be done vxconfigbackup <diskgroup name> Prepare for restore vxconfigrestore with either –n or -p -n - don’t update the private region header -p – private region header may be corrected to match backup Configuration in memory Volumes in read-only mode View the configuration with vxprint before committing it Commit restoration with vxconfigrestore –c Private region completely updated Discard the changes with vxconfigrestore –d You can restore to new hardware that isn’t inialized First bullet motivation New slide possibly for the stuff we store in the backup
23
Command and Transaction Logging
New in 4.x Command and Transaction Logging Command log History of VxVM commands run on the system Transaction log History of operations performed by vxconfigd Useful for auditing Actions taken by administrators Actions taken by client processes Logs kept in /etc/vx/log Along with existing GUI log The size and number of logs can be set by administrator When the GUI issues a cli to do an operation, both the GUI and the cmd log will log that the command was executed.
24
Command Logging vxcmdlog Log format - /etc/vx/log/cmdlog # vxcmdlog -l
New in 4.x Command Logging vxcmdlog # vxcmdlog -l Values for Control Variables: Command Logging is currently ON Maximum number of log files = 5 Maximum size for a log file = bytes Log format - /etc/vx/log/cmdlog # 32155, 1535, Thu Apr 21 13:15: /usr/sbin/vxprint When the GUI issues a cli to do an operation, both the GUI and the cmd log will log that the command was executed.
25
Transaction Logging vxtranslog Log format - /etc/vx/log/translog
New in 4.x Transaction Logging vxtranslog # vxtranslog -l Values for Control Variables: Transaction Logging is currently ON Query Logging is currently ON Maximum number of log files = 1 Maximum size for a log file = bytes Log format - /etc/vx/log/translog Thu Apr 21 13:15: Clid = 32155, PID = 1535, Part = 0, Status = 0, Abort Reason = 0 DG_GETCFG_ALL 0x5de49f DISCONNECT <no request data> When the GUI issues a cli to do an operation, both the GUI and the cmd log will log that the command was executed.
26
DRL and FlashSnap Both are bitmaps
DRL minimizes recovery time after system crash Tracks writes that are active on the volume Only active writes need to be recovered Flashsnap minimizes data copy after Disk/array/cable failure User error I could add 2-3 more slides on this
27
Dirty Region Log (DRL) Bitmap of regions where writes may be in progress Written before writing data Cleared lazily Only used if volume has at least two active mirrors Implementation Log subdisk May be combined with FlashSnap bitmaps in V4.0 Limited number of concurrent dirty bits Bounds recovery time Per volume limit (fixed at 256) Per system limit (tunable) Only the active writes need to be recoverd
28
FlashSnap Collection of bitmaps One bitmap for each snapshot
One bitmap for all detached plexes Bitmaps enabled after events Snapshots (point-in-time copies) of user data Plex detach (FastResync or FMR) Bitmaps on both original and snapshot volumes Refresh from A to B Restore from B to A Need to clear this up and get to a few clear points we want to communicate about how FMR works
29
FlashSnap in 4.0 Instant full-size snapshots
New in 4.x FlashSnap in 4.0 Instant full-size snapshots Snapshot created before copying any data Bitmap identifies the copied data Perform copy-on-write for uncopied regions Space optimized snapshots Full copy of data never needed Changed data stored in smaller cache volume Snapshot hierarchy Restore B from C (B and C are snapshots of A) DRL bitmap is part of the FlashSnap bitmap set One write will update both bitmaps Need to clear this up and get to a few clear points we want to communicate about how FMR works Before 4.0, the number of snapshots was fixed at 31 snapshots In 4.0 the number of bitmaps and hence then umber of snapshots can be configured by the administrator. Mention cascading snapshots New vxsnap command Create a volume and do the snapshots, no more need for the plex detach.
30
DRL vs FlashSnap (IO path)
my_vol my_vol SNAP-my_vol DRL Bitmap FlashSnap Bitmap my_vol-01 my_vol-02 my_vol-03 A A A 1 DRL bitmap is updated when the write is in progress and there are at least two active mirrors The flashSnap bitmap will only be updated after a event (snapshot/detach) When the write completes the DRL is/can be cleared After a snapshot, the third mirror is broken off into a new snapshot. The snapshot will not be updated The FlashSnap bitmap kicks in. The DRL bitmap behaves the same since there are still two active mirrors At the end, the flashsnap bitmap remembers the dirty bit since the snapshot The DRL bit forgets since it is cleared when the write completes. B B 1 1
31
Capacity Planning for Snapshots
Issue What percentage of a volume’s blocks is written during snapshot life vxtrace can be used to record all VxVM IO #vxtrace –l –g my_dg my_vol START write vol my_vol dg mydg op 0 block len 48 START write vol my_vol dg mydg op 0 block len 36 START write vol my_vol dg mydg op 0 block len 29 END write vol my_vol dg mydg op 0 block len 48 time 1 START write vol my_vol dg mydg op 0 block len 8 END write vol my_vol dg mydg op 0 block len 29 time 2 END write vol my_vol dg mydg op 0 block len 36 time 3 END write vol my_vol dg mydg op 0 block len 8 time 2 Analyze vxtrace output to determine % of volume written during vxtrace run time By default the output is saved to a binary file with Vxtrace –d foofile The output file can be read from the binary file with Vxtrace –f foofile
32
Volume Sets Group of different volume types (mirrored, striped…)
New in 4.x Volume Sets Group of different volume types (mirrored, striped…) VxFS uses private ioctls to do IO to the volume set VxVM commands manage individual volumes Enabling technology for VxFS Multi-Volume File System Separate file system meta-data from user data Allocate files on the “right” type of storage Relocate files based on changing conditions
33
Volume Sets All volumes made from disks in a single disk group
New in 4.x Volume Sets Up to 256 volumes of any type All volumes made from disks in a single disk group Individual volumes don’t appear in “/dev/vx/…” Example # vxvset –g homedg make HomedirSet MirrorVol # vxvset –g homedg addvol HomedirSet RAIDVol1 # vxvset –g homedg list HomedirSet VOLUME INDEX LENGTH STATE CONTEXT MirrorVol ACTIVE - RAID5Vol ACTIVE - File systems are created on volume sets # mkfs –F vxfs /dev/vx/rdsk/homedg/HomedirSet # mount –F vxfs /dev/vx/dsk/homedg/HomedirSet /home FlashSnap can snapshot all the volumes in the volume set to make a consistent copy
34
Conclusion VxVM is object based Robust Flexible
Visibility into performance at all levels Private region Diskgroups are self describing Transportable between UNIX platforms Accessible to third party developers New features in 4.0 Configuration backup and restore Command and transaction logging Instant snapshots Space optimized snapshots Volume sets
35
Next Steps New Features in Storage Foundation 4.0 and 4.1(S120R)
Thursday, April 28, 2005 (10:15 am – 11:15 am) VERITAS Storage Foundation Roadmap and Future Directions (S143) Wednesday, April 27, 2005 (8:00 am - 9:00 am) Strategies for Implementing Tiered Storage (S192) Tuning Dynamic Multipathing for Maximum Performance and Availability (S132R) Thursday, April 28, 2005 (10:30 am - 11:30 am) Remind attendees: “Remember to fill out your session surveys!”
36
& QUESTIONS ANSWERS Mike Root
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.