Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.

Similar presentations


Presentation on theme: "Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks."— Presentation transcript:

1 Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Achieving High Availability in a SAS® Grid Environment Daniel Wong Platform Computing Inc.

2 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Agenda Background Introduction to SAS Grid and HA Techniques for achieving HA SAS Metadata Server example Future work

3 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Achieving High Availability in a SAS Grid Environment Targeted Audience IT Managers/Administrators who want to improve overall availability in their SAS Grid environment Consultants who help customers to implement SAS Grid Anyone evaluating Grid wants to gain more general knowledge about HA in the SAS Grid environment …

4 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Platform Computing 4,000,000 Managed CPUs 2,000 Customers worldwide 500 Employees in 15 offices 16 Years of profitable growth 8 Years of Partnership with SAS 1 Leader in HPC

5 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. SAS-Platform OEM Partnership Started with Job Scheduler for SAS Warehouse Administrator v8 in 2001 Current OEM product: Platform Suite for SAS Platform provides the Grid middleware technologies to SAS for: Scheduling for SAS workflow Effectively management of resources and SAS workload Scaling out SAS (parallelized) Applications (DI, EM, Risk Dim…) SAS-Platform R&D teams constantly drive new development initiatives to add values to SAS customers

6 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Status of this work Goal of joint development initiative To provide failover capabilities for critical SAS system services in all operating systems supported by SAS Grid Manager Current status Working prototype that can failover SAS Metadata Server in both Windows & Linux SAS Grid environment Key advantages of this solution Eliminates operating system dependencies Eliminates the requirement of a hot-standby for failover Eliminates the expense of purchasing a third-party tool to provide high availability

7 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Agenda Background Introduction to SAS Grid and HA Techniques for achieving HA SAS Metadata Server example Future work

8 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. SAS Grid 101 Key Value Prop / Use Cases: Enterprise Scheduling Workload Balancing Parallelized Workload Balancing

9 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. SAS Enterprise Scheduling

10 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. SAS Workload Balancing

11 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Parallelized Workload Balancing

12 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. SAS Grid Architecture Topology

13 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. 4 key steps to implement HA 1. Monitor Discover service failures Capabilities required Detect failure of individual services Monitor services on all hosts in Grid 3. Synchronize Ensure client applications and Grid components are aware of new config Capabilities required Support for services running on host with a different IP address Client apps and Grid components must know about the new arrangement 2. Recover Restart service when failures are detected Capabilities required Capability to restart services on any host or designated hosts in Grid 4. Reconnect Enable clients applications to access the service again Capabilities required Client apps must be able to access to service after failover without having to reconfig

14 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Agenda Background Introduction to SAS Grid and HA Techniques for achieving HA SAS Metadata Server example Future work

15 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Platform Enterprise Grid Orchestrator (EGO) EGO is included in Platform Suite for SAS v4.1 / SAS Grid Manager 9.2 Available on all Unix, Linux, and Windows platforms supported by Grid Manager EGO provides a full suite for services to: Allows users to treat distributed sw/hw resources as components of a virtual computer Enables all applications, services, workload to access a shared infrastructure Allocates resources according to policy, simplifies management and improves availability of the entire environment We will be focusing in two EGO functions: EGO Service Controller EGO Service Director

16 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Monitoring and Restarting Services with EGOSC EGO Service Controller EGOSC starts on Grid Control Machine Responsible for starting services on remote grid nodes Ensure services are running by detecting failures and restarting service instances based on configuration Registering your service with EGO Create EGO Service Definition File Service Name – defines the name of the service, a handle for EGO EGOCommand – exact command to start your service HostFailoverInterval – Threshold for EGO to trigger failover ResourceRequirement – Host candidates for service ExecutionUser – UserID that runs the service EGO will start, monitor, and restart your service according to these definitions

17 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Location independence for failover Consider the following scenario: 1.Metadata Server (SASMeta) is configured to be started by EGO 2.Client App can lookup Metadata Server by: $nslookup $server SASMeta.ego.sas.com Server: (Corporate DNS server for client) Address: #53 Name: SASMeta.ego.sas.com Address: Metadata Server does down unexpectedly as hostA crashes 4.EGO detects Metadata Server is gone and restarts it on a failover host 5.Metadata Server is successfully restarted: $nslookup $server SASMeta.ego.sas.com Server: (Corporate DNS server for client) Address: #53 Name: SASMeta.ego.sas.com Address: Recovery is completed. Client App can still access Metadata Server by connecting to SASMeta.ego.sas.com

18 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. EGO Service Director Provides location mechanism for other system services Contains a stand-alone Domain Name Server for EGO DNS sub-domain Provides server name to IP address resolution within the sub-domain Relies on Service Controller to provide location info for service instances Service Director updates Corporate DNS when location of Service Director DNS is changed

19 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Grid Nodes Grid Control Machine Normal Case Host B Service Controller EGO Kernel Service Director Corporate DNS Server Service Director DNS Server Service A Host C Host n Client … Launch/Control/Monitor Update location of service instances Update location of SD DNS Server 1.Client makes a DNS query to determine location of Service A 2.Corporate DNS forward query to SD DNS 3.SD DNS responds with latest location of service 4.Corporate DNS pass the response back to Client 5.Client uses the response to locate the service Dont know anything about the EGO sub- domain, lets pass it on

20 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Service A Grid Nodes Grid Control Machine Failover Case Host B Service Controller EGO Kernel Service Director Corporate DNS Server Service Director DNS Server Host C Host n Client … 1 3 Launch/Control/Monitor Update location of service instances Update location of SD DNS Server 1.EGO discovers Service A has failed 2.Service Controller restarts Service A on failover Host C 3.Service Controller updates Service Director DNS Server 4.Client obtains new name-to-IP address mapping to access Service A on Host C Service A 2 4

21 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Whats going to failover EGO Service Controller? Answer: Load Index Manager (LIM), the first process responsible for starting all Grid Services Whats going to failover LIM? LIM runs on each node under a master-slave arrangement Master and slave LIM communicates host load index info regularly If Master LIM cannot be reached, Slave LIM will figure out among themselves a new master LIM New master can restart EGO Service Controller

22 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Agenda Background Introduction to SAS Grid and HA Techniques for achieving HA SAS Metadata Server example Future work

23 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Example: HA for SAS Metadata Server Install SAS Metadata Server Install LSF Assumptions: hostA = default host for Metadata Server hostB = failover host for Metadata Server \\hostF\LSFShare\\hostF\LSFShare = shared config dir accessible by all hosts in Grid \\hostF\SAS\\hostF\SAS = shared config dir for SAS Metadata Server, accessible by hostA and hostB 1.Install Platform Suite for SAS 4.1 provided with SAS Grid Manager hostA should be the Grid Control Machine 3.hostB should be the failover host for the Grid Control Machine (same with Metadata Server) Configure EGO Services Failover Tests Eliminating Hot-Standby

24 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Example: HA for SAS Metadata Server Install SAS Metadata Server Install LSF 1.User a/c running Metadata Server must has full access to \\hostF\SAS\\hostF\SAS 2.Install Metadata Server on hostA 3.Set configuration directory to \\hostF\SAS\\hostF\SAS 4.Repeat #3 for hostB Note: SAS and 9.2 install procedures are different Configure EGO Services Failover Tests Eliminating Hot-Standby

25 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Example: HA for SAS Metadata Server Install SAS Metadata Server Install LSF 1.Create EGO service definition file for Metadata Server 2.Specify hostA & hostB as host candidates 3.Enable Service Director to start automatically on hostA and failover to hostB 4.Configure EGO Name Service (DNS) – set domain names, encryption key to CorpDNS 5.Configure Corporate DNS to forward EGO sub-domain query to EGO DNS 6.Restart all EGO services 7.Disable DNS cache on client machine 8.Test with nslookup Configure EGO Services Failover Tests Eliminating Hot-Standby

26 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Example: HA for SAS Metadata Server Install SAS Metadata Server Install LSF 1.Connect with SAS Management Console to check if Metadata Server is functioning on hostA 2.Shutdown hostA 3.Check with egosh service list to see Metadata restarts on hostB 4.Restart SAS MC, it should connect to Metadata Server again on hostB Configure EGO Services Failover Tests Eliminating Hot-Standby

27 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Example: HA for SAS Metadata Server Install SAS Metadata Server Install LSF This step is optional When Metadata Server is running on hostA, hostB can be a Grid node running SAS jobs During failover, LSF can be reconfig not to send jobs to hostB Use a wrapper script to close the host for accepting jobs before starting Metadata Server Jobs already running on hostB will continue to run until finish, but no more jobs will be dispatched to hostB Configure EGO Services Failover Tests Eliminating Hot-Standby

28 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Agenda Background Introduction to SAS Grid and HA Techniques for achieving HA SAS Metadata Server example Future work

29 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Grid Nodes SAS JobA Failover for Grid Nodes? Host 1 LSF Host 2 Host n … SAS JobA Host 1 has crashed, requeue if JobA is rerunnable LSF can requeue a job when it detects that the host running the host has failed, and the job is defined as rerunnable When the job is requeued, LSF will try to dispatch this job to another host that is suitable Rerun job will start running from the beginning, meaning that all previous computation and data results will have to redo again

30 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Grid Nodes SAS JobA SAS 9.2 Checkpoint with Requeuing Host 1 LSF Host 2 Host n … SAS JobA Host 1 has crashed, requeue if JobA is rerunnable Checkpoint Images written to shared directory when Job A was running on Host 1: JobA – chkpnt#1 JobA – chkpnt#2 … JobA – chkpnt#17 … Host 1 crashed !! SAS JobA on Host 2 resumes from chkpnt#17

31 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Conclusion Working prototype that can failover SAS Metadata Server in both Windows & Linux SAS Grid environment Key advantages of this solution Eliminates operating system dependencies Eliminates the requirement of a hot-standby for failover Eliminates the expense of purchasing a third-party tool to provide high availability Future Work Provides HA configuration templates for the rest of SAS services and platforms Simplify configurations Provides support for SAS 9.2 checkpoint and restart mode

32 Copyright © 2008, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.


Download ppt "Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks."

Similar presentations


Ads by Google