Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 1 GS Gruppe møde 16.09.2008 High Availability overvejelser Ellen.

Similar presentations


Presentation on theme: "© 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 1 GS Gruppe møde 16.09.2008 High Availability overvejelser Ellen."— Presentation transcript:

1 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 1 GS Gruppe møde 16.09.2008 High Availability overvejelser Ellen Dreyer Andersen Certificeret IT specialist, IBM Danmark A/S

2 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 2 The Myth Of Nines Availability Availability %Downtime per Year 98%7.30 days 99%3.65 days 99.5%1.83 days 99.9%8.76 hours 99.99%52.6 min 99.999%5.26 min 99.9999%31.5 seconds Reliability is not the same as Availability! Engineers and marketing people like to use this table because it looks impressive. The ‘Myth Of Nines’ is caused by the following false assumptions: When the computer is operating, so is the users business Ten 1-minute outages in one day has the same effect on the user as one 10-minute outage in a day Planned maintenance is not included in the calculation In the real world availability is defined in a Service Level Agreement (SLA)

3 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 3 Downtime Source: IBM HA Presentation, Eric Hess, April 2001 Causes Of Downtime Solution Required * ‘HA’ generally refers to solutions that provide BOTH recovery and availability. Not all technologies provide a solution for BOTH…iTera 5.0 HA does Downtime refers to a period of time or a percentage of a timespan that a machine or system (usually a computer server) is offline or not functioning, usually as a result of either system failure (such as a crash or routine maintenance. Reliability is not the same as Availability! Disaster Recovery High Availability (Continuous Operations)

4 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 4 Business Continuity is: ▪ Capability of a business to withstand outages and operate mission critical services normally and without interruption per a pre-defined service level agreement –Solution must address data, operational environment, applications, the application hosting environment, and the end user interface –Requires a collection of services, software, hardware, and procedures to be selected, described in a documented plan, implemented, and practiced regularly ▪ Includes both Disaster Recovery (DR) and High Availability (HA) –DR addresses the set of resources, plans, services and procedures to recover and resume mission critical applications at a remote site in the event of a disaster –HA defined as the ability to withstand all outages (planned, unplanned, and disasters) and to provide continuous processing for all mission critical applications

5 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 5 Application Resilience Fully transparent Full resilience with automatic restart & transparent failover Users repositioned to last committed transaction No data loss, no sign-on required, no perceived loss of server; only delay in response Semi-transparent: Automatic application restart & recovery to last transaction boundary The resilient data & the application restart point match exactly Semi-automatic: Automatic application restart & recovery to some architected application “restart” point Normally consistent with state of data, but user may have to manually match application to position of data Basic application failover: Automatic application restart after outage User manually repositions within application No application recovery: Users manually restart application with resilient data User determines where to resume work Single Server Data resiliency Start over. Where's all my work? Checkpoint restart. Not too bad. Huh? Did something happen? HA enabled applications and iSeries Clusters iSeries Clusters Combine with Data Resilience for complete solution

6 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 6 Major Business Continuity Problem Sets HA for planned outages HA for unplanne d outages Disaster recovery Workload balancing Backup window reduction Starting point is to fully identify the set of availability problems that you are attempting to address

7 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 7 What is a Service Level Agreement (SLA)? ▪ General: –Contractual service commitment. –A document that describes the minimum performance criteria a provider promises to meet while delivering a service. –Typically also sets out the remedial action and any penalties that will take effect if performance falls below the promised standard. ▪ Relative to Availability: –Commitment to the business describing the level of availability for IT services that support critical business solutions. –Addresses when IT services are expected to be fully operational, when they may be running degraded, and when they won’t be available –Driven primarily be importance of IT services in providing business solutions, cost factors, and realism. ▪ Many factors involved

8 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 8 Detailed Availability Criteria 01234567 low medium high X X X X X X X X X X X LessMore Generalizations (e.g., categorizing into major problem sets) can distort the answer. Should use as a guide to potentially rule out or select solution types In practice, consider all major supporting factors such as:  Up time requirements  Recovery point objective (RPO)  Recovery time objective (RTO)  Resilience requirements  Outage type coverage  Concurrent access  Geographic dispersion  Tolerance for end user disruption  Cost factors  Downtime window availability  Services and support

9 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 9 The Fundamentals ▪ Data Replication alone is not sufficient for HA ▪ Clustering, Automation and application resiliency completes the equation Data Protection - Raid 5 & Mirroring Transaction Integrity - Journaling Data Resiliency – IASPs, Replication Application Resiliency Clustering

10 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 10 Data Resiliency Replication and Switched IASPs Cluster Resource Services Base OS/400 - i5/OS cluster functions from IBM APIs Cluster Management iSeries Navigator or partner products Application Resiliency High availability cluster enabled applications OS/400 - i5/OS Option 41 – HA Switchable Resources Clustering ▪ A property of the Operating System ▪ Provides the logical connections between resilient data groups ▪ Can enable the automation of physical and logical switching ▪ Can enable a resilient application to be “switched”, activated and repositioned to a defined state ▪ Enables the automatic sequencing of events that bring the user, application and data to a coherent production state automatically ▪ Application design is the primary limiting factor ▪ Heart beating ▪ IP Address Takeover ▪ Reliable internal cluster communications ▪ Switchover administration ▪ Distributed activities

11 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 11 Data Resilience Technologies Detailed Attributes of Solutions

12 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 12 Data Resilience Technologies ▪ Logical Replication –Business partner software product ▪ Switchable Device –Switchable IASPs ▪ Operating System Storage Management based Replication –Cross-Site Mirroring (XSM) with Geographic Mirroring ▪ Storage Server based Replication –Total Storage PPRC used with iSeries Copy Services toolkit –Total Storage PPRC used with SAN Load Source

13 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 13 Logical Replication ▪ Second copy of data is generated logically identical to first ▪ Replication done on object basis (file, member, data area, program, etc.) near real-time ▪ Normally done via a business partner software product

14 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 14 Logical Replication ▪ Most widely deployed data resiliency topology for iSeries –Typically deployed via an HA Business Partner solution package –Replication done on object basis (file, data area, program, etc.) near real-time Done at the lowest unit of change for the object, e.g. record level for database files Otherwise, done on entire object when change detected by replication software OS/400 Remote Journaling provides efficient, reliable transport mechanism ▪ Benefits: –Rapid activation of production environment on backup server via role-swap operation –Replicated data can be concurrently accessed for backups or other read-only apps –Minimal recovery is needed when switching over to the backup copy ▪ Considerations: –Complexity of setup and maintenance –Modification of "live" copies of objects on backup server –Lag time between changes on source being available on backup server –Consistency between journaled and non-journaled objects

15 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 15 Logical replication via remote journaling

16 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 16 Replication of ALL critical data and object types ▪ Data (libraries, files) ▪ Data areas ▪ Data queues ▪ IFS (real time journal based mirroring) ▪ MQ/WebSphere ▪ User profiles (real time) ▪ Spool files and output queues ▪ Program objects ▪ Controllers, lines and devices ▪ IBM job scheduler entries ▪ Directory entries ▪ Tape media library support

17 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 17 Switchable IASPs ▪ Single copy of data is maintained ▪ Data in Independent ASP (IASP) switched to backup system during outage

18 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 18 Switchable IASPs ▪ Independent Auxiliary Storage Pools (IASPs) –OS/400 Option 41 - High Availability Switchable Resources –Switch disks from one system to another ▪ Benefits: –Simplicity –Data is always current (no copy to synchronize) –No in-flight data to lose –Minimal performance overhead –Supports integrated environments (Windows, Linux) as well as i5OS ▪ Considerations: –Setup DASD configuration, data, and application structure –Single copy of data (mirroring recommended to protect data, reduce SPOFs) –No concurrent access from both hosts –HW restrictions (distance, conc maint) –DB restrictions on cross IASP relationships

19 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 19 Cross-Site Mirroring (XSM) with Geographic Mirroring ▪ Second copy of data in an IASP is generated logically identical to first ▪ Changes to production IASP replicated to second copy of IASP thru another system ▪ Operating system storage management based replication solution

20 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 20 Cross-Site Mirroring (XSM) with Geographic Mirroring ▪ Mirroring of IASP data via OS/400 storage management to a second server –Included in Option 41 of OS/400 V5R3 –Enables switching or automatic failover to mirrored copy of IASP ▪ Benefits: –Same as switched device –Two copies of IASP data –Can be local or remote –Ease of deployment and operation –Supports integrated environments (Windows, Linux) as well as i5OS ▪ Considerations: –Performance impacts of synchronous operation, distance, bandwidth, latency –Mirror copy cannot be concurrently accessed –Impractical to detach mirrored copy to do backups to tape –Full data re-synchronization required for any persistent transmission interruption –Recommend at least three system configuration ----------------------------------------------------------------------------------------------------------------

21 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 21 Total Storage PPRC with iSeries Copy Services toolkit ▪ Second copy of data is generated physically identical to the first ▪ Total Storage peer to peer remote copy (PPRC) function combined with IASP ▪ Toolkit provides automation and reliable, foolproof operation ▪ Storage server based replication solution

22 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 22 Total Storage PPRC with iSeries Copy Services toolkit ▪ Replication of IASP data at storage controller level to second ESS or DS using PPRC –PPRC generates a second copy of the IASP on another Total Storage server –Toolkit part of iSeries Copy Services for IBM TotalStorage offering –Combines PPRC, IASP, and OS/400 cluster services for –Coordinated switchover/failover ▪ Benefits: –Remote copy and coordinated switching without an IPL –Can combine with FlashCopy for backup window reduction ▪ Considerations: –Performance impacts of synchronous mode: distance, bandwidth, latency –Mirror copy cannot be concurrently accessed –Asynchronous mode requires IBM TotalStorage Global Mirror –Requires tools and services to deploy

23 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 23 Total Storage PPRC with SAN Load Source ▪ Second copy of data is generated physically identical to the first ▪ Total Storage peer to peer remote copy (PPRC) function combined with Boot from SAN ▪ All data, include load source, is replicated to second external storage server ▪ Storage server based replication solution

24 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 24 Total Storage PPRC with SAN Load Source ▪ Replication of data at storage controller level to second storage server using PPRC –PPRC generates a second copy of all data on another Total Storage server –Combines PPRC and boot from SAN for disaster recovery ▪ Benefits: –Generates complete copy system for DR –Requires no changes to applications ▪ Considerations: –Fail over requires manual intervention and careful recovery –All changed data, including temporaries, are replicated High transmission volumes between Total Storage servers –Only for DR (do not consider if need HA) –Performance and distance implications with synchronous mode –Asynchronous mode requires IBM TotalStorage Global Mirror

25 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 25 Data Resilience Technologies Applicability and Comparisons

26 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 26 Key Comparison Characteristics 1. Primary use 2. Characteristic of Replication Mechanism 3. Recovery Time 4. Recovery Point 5. Ordering of changes 6. Concurrent access 7. Geographic dispersion 8. Number of Backup systems 9. Number of Data copies allowed 10. Cost Factors 11. End User 12. Outage coverage 13. Cluster controlled resource 14. Risks Consider other decision factors

27 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 27 Start to determine possible matches of technologies to specific needs 1.Initial analysis to eliminate technologies that do not fit 2.After initial analysis, perform detailed analysis of complete requirement sets against specific characteristics of each technology Logical replication Switched diskXSMPPRC with Copy Services toolkit PPRC with SAN Load Source Backup Window Reduction n/a Planned Maintenancen/a Recovery for disaster outage n/a HA for unplanned outage n/a Workload Balancingn/a Data Resilience Technologies Applicability of Solution to Problem Set Business Continuity Requirement

28 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 28 Logical ReplicationSwitchable IASPs XSM w/ Geographic Mirroring Total Storage PPRC w/ IASP & iTC Toolkit PPRC with SAN Load Source Primary use HA (including DR)HA (no DR)HA (including DR) DR Character- istic of replication mechanism Object based replication; changes at record or object level based on data & audit journal. Logical copy of object level changes for selected objects. No replication; 1 copy of data that is switchable between systems Page level replication as controlled by operating system based on storage management writes. Logical copy since physical DASD configs can differ. Sector level replication of all pages written to disk. Physical copy of an IASP based on disk I/O (cache based). Sector level replication of all pages written to disk. Physical copy of entire system based on disk I/O (cache based). Recovery Time Considera- tions Apply lag + replication switchover overhead. Journal settings No IPL required Minutes IASP Vary on SMAPP / Journal settings No IPL required Minutes IASP vary on SMAPP / Journal settings No IPL required Minutes IASP vary on SMAPP / Journal settings No IPL required Minutes System shutdown and IPL time. Manual steps. SMAPP/Journal settings IPL required before use backup More than 1 hour Recovery Point Considera- tions Transaction boundary with commitment control. Mixed – audit and data journal. Data / objects sent to target will be recovered. Lose changes not xmitted (zero data loss with synch remote jrn). Transaction boundary with commitment control. Last data written to IASP. Objects not in IASP. Transaction boundary with commitment control. Last data written to IASP. Objects not in IASP. Transaction boundary with commitment control. Last data written to IASP. Objects not in IASP. System shutdown to force changes to disk. Transaction boundary w/ commitment control. Last data written to disk

29 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 29 Logical ReplicationSwitchable IASPs XSM w/ Geographic Mirroring Total Storage PPRC w/ IASP & iTC Toolkit PPRC with SAN Load Source Ordering of changes Based on journal receiver content & HABP ability to synchronize changes from data & audit journals. Ordering preserved Ordering at system level. Ordering preserved across ASP group. Ordering at controller level. Preserved at LUN (disk) level for Metro Mirror. Consistency groups for Global Mirror. Ordering at controller level. Preserved at LUN (disk) level for sync PPRC. Consistency groups for Global Mirror Concur- rent access Typically read only, possibly shared data. Always some lag time in data currency. Remote Journal helps. No concurrent access since no copy of data No – requires resynchronization. Second copy current. No concurrent access. Copy current with Metro Mirror; Consistency groups define lag time for Global Mirror. No concurrent access. Copy current with Metro Mirror; Consistency groups define lag time for Global Mirror. Geo Disper- sion Virtually unlimitedLimited (250 M)Virtually unlimited # Back up systems 1<= n <127 (or BP max) n=1 (with switchable towers) 1<= n <=3 (2 or 3 with switchable towers) 1<=n<=2 (2 with cascading PPRC) 1<=n<=2 (2 with cascading PPRC) # data copies 127 (or BP max)none122

30 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 30 Logical ReplicationSwitchable IASPs XSM w/ Geographic Mirroring Total Storage PPRC w/ IASP & iTC Toolkit PPRC with SAN Load Source Cost Factors Any DASD configuration. HABP software. Bandwidth. Duplicate disks. Switchable tower (or IOP) i5/OS Option 41 Any (flexible) DASD configuration. i5/OS Option 41. Bandwidth. Duplicate disks. Ext DASD (2 ESS or DS). PPRC Bandwidth. i5/OS Option 41.. Toolkit. Duplicate disks. Ext DASD (2 ESS or DS). PPRC. Bandwidth. Duplicate disks. Idle backup system High volume of sector replication. End User Disruption Replication overhead. Can automatically restart application. Geographic mirroring overhead. Can automatically restart application. PPRC overhead. Can automatically restart application. PPRC overhead System shutdown Manual application restart Outage coverage Planned, unplanned, disaster, save window Planned, unplanned Planned, unplanned, disaster Disaster Cluster control Yes Yes – of switchable devices No – Manual only Risks Loss of in flight data. Mismatch of data levels for various objects. Monitoring logical object replication environment Disk subsystem is single point of failure, therefore no protection against catastrophic disk failure Asynch case: can lose copy on double failure if cannot quiesce & vary-off. Resynch after detach may yield lengthy unprotected condition. Somewhat complex Asynch PPRC via Global Mirror. Disk protection provided by Total Storage. Systems cannot be concurrently active. Complex Asynch PPRC via Global Mirror. Long IPL.

31 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 31 Conclusions ▪ When to consider Logical Replication? –Need two or more copies of the data –Want some level of concurrent access to second data copy –Need backup window reduction –Monitoring the state of the replication environment can be done by your IT staff –Geographic dispersion between copies is needed –Already have solution deployed using logical object replication –Need a solution that has no special hardware configuration requirements –Failover / switchover times should not exceed 10's of minutes –Transaction level integrity is important for all journaled objects ▪ When to consider Switchable IASPs ? –Single copy of data meets requirements; addressed exposure to disk subsys failures –Need a very simple, low cost, low maintenance solution –No need for DR solution –Source and target system will be at the same site –Want consistent fail/switchover times within minutes independent of transaction volumes –Need transaction level integrity for all objects; no loss of in-flight data –Need highest throughput environment –Need multiple, independent databases that can be moved between systems

32 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 32 Conclusions (2) ▪ When to consider Cross-Site Mirroring ? –Want a system-generated second copy of the data (at an IASP level) –Need two copies of data, but do not need concurrent access to second copy –Want relatively low cost, low maintenance solution, but also need disaster recovery –Need geographic dispersion between copies; distance does not impact performance goals –Want consistent fail/switchover times within minutes independent of transaction volumes –Need transaction level integrity for all objects; no loss of in-flight data ▪ When to consider Total Storage PPRC with IASP and Toolkit ? –Want storage based solution for HA; especially if multiple platforms are involved –Want consistent fail/switchover times within minutes independent of transaction volumes –Need two copies of data, but do not need concurrent access to second copy –Need geographic dispersion between copies; distance does not impact performance goals –Need transaction level integrity for all objects; no loss of in-flight data ▪ When to consider Total Storage PPRC with Fiber Channel Load Source? –Want storage based solution for DR only; especially if multiple platforms are involved. Do not need HA. –Manual recovery and recovery time of hours is acceptable –Need two copies of data, but do not need concurrent access to second copy –Need geographic dispersion between copies; distance does not impact performance goals –Need transaction level integrity for all objects; no loss of in-flight data

33 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 33 Conclusions (3) ▪ When to consider a combination solution? –When no single solution meets all of your business continuity requirements

34 © 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 34 Data Resiliency Replication and Switched IASPs Cluster Resource Services Base OS/400 - i5/OS cluster functions from IBM APIs Cluster Management iSeries Navigator or partner products Application Resiliency High availability cluster enabled applications OS/400 - i5/OS Option 41 – HA Switchable Resources Summary: ▪ Heart beating ▪ IP Address Takeover ▪ Reliable internal cluster communications ▪ Switchover administration ▪ Distributed activities Data Replication alone is not sufficient for HA Clustering, Automation and application resiliency completes the equation ▪ Clustering PLUS ▪ Data replication PLUS ▪ Cluster enabled replication ▪ EQUALS ”real” HA


Download ppt "© 2006 IBM Corporation ibm.com/ redbooks International Technical Support Organization 1 GS Gruppe møde 16.09.2008 High Availability overvejelser Ellen."

Similar presentations


Ads by Google