2 Co-authors Reviewers: Tan Wee Kiong, VCAP-DCD, VCAP-DCA, vExpert, VMware Fraser Hall, VCP, Solutions Architect, VMware Michael Webster, VCDX, Strategic Architect, VMware Heera Singh, VCAP-DCD, VCAP-DCA, UBS Bank Kenneth Chan, VCP, Principle Engineer, Trusted Source Benjamin Troch, VCAP-DCD, VMUG Singapore Leader, vExpert Ivan Chee, VCP, DBS Bank Amitabh Dey, VCP, vExpert Iwan Rahabok Staff SE, VMware | Linkedin.com/in/e1ang VCAP-DCD, TOGAF Certified, vExpert Arseny Chernov Enterprise Architect, FSI, Dell | Linkedin.com/in/arsenychernov VCAP-DCD
3 MAS: TRM Guidelines Published on June 2013 Combines previous documents into 1: IT outsourcing Endpoint security and data protection, Information systems reliability, Resiliency and recoverability Focuses on availability & security & performance Covers the risk only, not the benefit. Technology can lower risk & improve security compared to pen and paper. Usage of certain capability of technology should in fact be encouraged. Cars have brake so we can drive faster. Security and Availability should not slow down the speed of business. Written for FI (Financial Institution) Both global FI and local FI, which operates in Singapore, follow this closely. MAS conducts regular audit. A challenging period for the FI being audited as it’s difficult to fully comply to this while being cost effective. This is where virtualisation can help. Virtualisation enables FI to comply better while lowering the cost of business.
4 Content’s Analysis Analysis (from Virtualisation’s centric view) Does not mention virtualisation at all. Puzzling as virtualisation is the single largest disruption to IT architecture and operation. Classic physical world architecture & operation. By itself, physical separation is not true security. Virtualisation also changes the risk as physical separation is removed. Thinking is influenced by TOGAF and ITIL. Process and Committee oriented. No Agile and rapid innovation. Rather high level; subject to interpretation. Resulted in inconsistent interpretation by FI. FI tend to think Active/Active datacenter is required.
5 Content’s Analysis Overall Analysis: Trust no employee Sys Admin must be tracked. Guard for internal attack Internal sabotage is considered a real risk. More dangerous than external attack due to intimate knowledge Broad stroke conservative. All social media sites, cloud-based storage, web-based s are classified as “unsafe internet services”. No technical fact given to support why they they are all insecure. DR, not DA, is a must DA is not even mentioned. Guard the end point Smart phone, tablet, laptop IT audit must be independent In some FI, they are no longer reporting to CIO.
6 Detailed Analysis DR and Availability Network & Security Application People and Process End User Computing Others Additional Explanation is provided in the speaker notes
7 DR: Clauses
8 Worst case is not just when entire datacenter goes down. It could also be no expertise around (e.g. on block leave or sick leave) An outdated Documentation can be more risky than no documentation. An wrong procedure can cause serious outage. Manually updating the DR server is not “replication”. Replication implies system.
9 DR: Analysis The changes must be replicated & migrated, not manually implemented in DR Site In the physical world, the DR server has a different host name and IP, and is a separate, independent instance. Production is not equal DR. Changes must be manually implemented. Virtualisation solves this by replicating the entire server, not just the data drive. Replication can be done at the hypervisor-layer and IP. Not limited to SAN and FC. Replication can also be rolled back to previous snapshot A Test shall not require Production to be down. If a test requires production to be down, then it is a flaw in the principle. Classic DR architecture needs Production to be down as it can’t isolate the network. A VM is much easier to control than a physical machine. A Test should be fully automated and assume the worst case scenario You cannot assume all the technical people are available. It must be executed with minimal IT knowledge. Most FI do not have ability to test full-datacenter scenario as it’s too manual and complex in the physical world DR solution that relies on MS Excel or MS Word gets outdated easily. Manual DR results in “adhoc” as it’s impossible to guarantee human will execute to the dot With Site Recovery Manager (SRM), the entire run book is no longer manually maintained. There is no more “procedures” to be “evaluated and updated” as the run book is no longer a separate, detached entity. It is defining the recovery plan. The plan and the documentation are one, and there is no longer a need to manually ensure they are in-sync. SRM provides built-in audit as it records every steps in the run book. Hence it also serves as the repository of tests history. Every steps is recorded, and not typed manually in MS Word. Manual solution such as MS Word means we cannot prove that what actually happened is what is documented/reported. If we do regular & frequent Failover, that requires Failback to be automated too. Annual test is too infrequent due to too many changes Changes in users (new staff, new roles) and IT personnel. Changes in applications and infrastructure Application dependancy can be automatically discovered and shown by the hypervisor The dependancy can be integrated with the DR plan. There is no need to deploy an agent as it’s provided by the hypervisor
10 Storage availability DAS, NAS, SAN. What if we have a new architecture? This requirement comes out because there is only 1 central array. What if we have many, just like Server? Merely bringing up the Storage subsystem alone is not suffcient. Server, Network, Security, Management must be up too.
11 Storage availability Storage virtualisation addresses the physical array being a SPOF Instead of 1 monolithic, complex storage, the storage systems is distributed. Every ESXi node become a storage subsystem too. It becomes harder to have a major outage as the storage is now distributed. Not just the storage is distributed, the data can be duplicated too. A critical VM might need to have quad- redundancy and its data is duplicated 4x (so it exists in 4 different ESXi host) Storage vMotion allows data to be migrated live from 1 physical array to another. With a 10 GE connection, the data migration can be completed relatively quick. This replication is independent of underlying technology. It can replicate from FC array to iSCSI array, or from VMFS to NFS Having the data replicated is not enough Simply having the storage subsystem replicated does not mean the entire system is serving the users. Access to this data (storage) is via Server, and access to this Server is via Network, and this Network needs to have Security, and the entire System needs to be Managed. So we need to replicate entire datacenter, not just Storage. Synchronous replication means “bad” data is replicated right away. No reaction time from IT. Replication should enable multiple snapshot. This allows IT to roll back to point in time.
12 RTO is practically a lot less than 4 hours RTO starts the moment the system is down, not the time CIO approves DR activation. Basing on cloud 8.2.4, it starts “from the point of disruption”. So if critical service gets disrupted at 3 am, and CIO declares disaster at 6 am, there is only 1 hour left to bring up everything. It’s also an “application level” definition, not infrastructure level. So if a multi-tier apps needs 1 hour after OS is up, then it has to be considered. Getting just the OS up and running is not good enough.
13 RTO is practically a lot less than 4 hours The generally industry-accepted definition of RTO is shown below MAS definition of RTO is actually MTO. In reality, especially for multi-tier applications, you may only have minutes, not 4 hours. Annual exercise is way too long to be able to react within that 4 hours MTO. People forget, procedures change. Manual procedure will not be able meet 4 hours MTO The entire DR response has to be automated. The only thing manual is the decision.
14 Active/Active datacenter, or A/A application? A lot of FI are implementing Active/Active Datacenter In Active/Active datacenter, both arrays are serving Production workload. This requires very careful operation. They become 1 logical datacenter, with 1 network spanning across both physical sites. A failure could bring down both datacenters as it can propagate from 1 DC to another. This is especially on the part that is connected (e.g. stretched network, replicated storage, stretched cluster or single vCenter) Active/Active datacenter is not a DR solution. Stretched Cluster is often mistaken as a DR solution. It is a DA solution. For a technical explanation, refers to Consider a Hybrid Datacenter instead Site 2 is running Production workload on a separate storage, which can be patched independantly. This allows FI to patch the non-production array to see the impact first. Virtualisation makes going beyond 2 hardware boxes easier In physical world, HA is typically achieved by having 2 hardwares (acting as a pair). This protects when 1 box is down. But it can not withstand 2 failures. In the virtual world, we can withstand more than 2 node failures, both at the compute node and storage node.
15 Complex inter-dependancy Network virtualisation can help removing the SPOF of physical network While the physical network has redundant hardwares and path, it is still 1 network. Stretching the network to another datacenter actually extends the risk, as a network outage or misconfiguration can bring down both datacenter as they are 1 network. Network virtualisation is an emerging technology in It decouples the system’s network from the underlying datacenter network. Changing the datacenter network will be like changing the hypervisor; it will have no impact to the VM or network above it. Great explanation by Ivan Pepelnjak on why FI should think deeper before implementing a stretched network, creating 1 logical Datacenter from 2 physical DC. blog.ioshints.info/2012/10/if-something-can-fail-it-will.html blog.ioshints.info/2011/06/stretched-clusters-almost-as-good-as.html “Interconnected things tend to fail at the same time.”
16 Security protection Need to protect against Internal attack. So having IDS/IPS and FW are edge is non compliant.
17 Security Hypervisor-based protection is superior to agent-based or network-based protection The hypervisor can see everything, but it cannot be seen by the attacker/malware It makes security an integral component. Adding a compute node means adding a security node. The attacker will not be able to remove the agent as it is agentless. The AV is not visible to the malware or rogue sys admin. A sys admin has local privilege to the Guest OS and can uninstall or alter any settings, so having an agent is not a tamper proof solution. It ensures that AV is deployed. Instead of deploying for 1000 servers and 10,000 desktops, we just need to deploy per ESXi host. It is easier to manage. A typical VDI has 80:1 consolidation ratio, cutting the amount of nodes to manage by 8000% It has less impact on performance. Running a full scan on Windows can render the OS practically useless for other purpose. For Firewall, it is not limited to TCP/IP. Hence, the rules are also more aligned with IT policy, and easier to understand and review. For example, a rule can state that all virtual desktops, regardless of IP address or applications, in the contractor folder cannot access Internet. This results in both simpler firewall rules, and a lot less rules. Physical FW relies on TCP/IP. If the server is compromised and IP address is changed, the firewall rule will no longer apply. Also, it has many complex rules, making it difficult to understand the complete rules. It is hence easy to make mistake. As the hypervisor provides security services, IDS/IPS can be deployed on every juncture, not just limited to “critical juncture”. It can quarantine a new server until it is approved, as it controls the network connection. It is easier on the network & storage. Updating 1000 Windows can take serious toll on the physical resource. Security by Physical separation is good, but… It can’t be the only security solution. Physical separation is in fact not a solution as systems are now interconnected in a complex network. While it is physically separated, they are still connected via network. The security must remains in place when the physical separation no longer exist (e.g. connection is bridged accidentally). The use of VLAN in networking means there is no physical isolation. Isolation can be fully achieved even in the same cluster. Separating the cluster can in fact gives a false assurance that there is security if the physical isolation is the only “solution” in place. With multi-tenancy (even within 1 FI), a new set of tools have to be deployed. These tools are built for virtualisation. Penetration Test for virtual world is different There are now 2 distinct layer (Consumer and Supplier) as virtualisation decouples a physical machine into 2 layers. The penetration test at the VM layer does not change. The penetration test at the vSphere layer change.
18 Virtualisation can retain the separation It is cost prohibitive to fully comply with the TRM Guideline as it requires a lot of “separation” and “segregation”. Virtualisation lowers the cost. But needs to take into account Operational discipline and guards againts human error. The reason is it is much easier to make changes in virtual environment.
19 VMware Security Advisories Cover risks identified through the breadth of customer install base, as long as through support logs analysis, and provides detailed impact and remediation information for each case; Available at VMware Health, Profiles, Security Hardening Checks An established framework of tools, developed by VMware is available to perform scheduled overhauls for compliance checks; Technology Risk Compliance Checks
20 Response Plan for Security Incidents Can be Scripted With vSphere network virtualisation and extensive API set, framework of response scripts that would separate environment into isolated zones / shut down breached services / isolate users could be in possession of Global Information Security team as the “Red Button” Near real-time audit of changes and ability to extract unusual entitlements (i.e. matching roles and users back to objects / VMware resource pools) could help prevent internal sabotage or detect it a-posteriori. Complete integration with Active Directory or LDAP Services will decrease risk of permission delegation that could be outside any central regular governance and audit; Knowledge of virtualisation is required within Security team Response to Incident
21 Full Logging is available Out of the Box vCenter database is the single resilient archive for all events and tasks submitted by users; vSphere comes with an out-of –the-box syslog collector for full low-level logging that would enable a- posteriori compliance; vShield logging allows audit of all changes introduced to virtualised networks; Centralisation of many workloads into one logs rollup enables convenient logs scanning by IPS / IDS; Root Cause Analysis is not covered under the normal Production Support It’s under Premier Support (business critical or mission critical) Source: On-site Support is only available in Mission Critical Support Root cause analysis
22 Application: Clauses Separate environment for Production UAT System Test Integration Test Development Some may even need a Production Fix environment. Separate environment for Production UAT System Test Integration Test Development Some may even need a Production Fix environment. Common apps do have Reference Design on vSphere Roll back and snaphot is required. Complete cloning helps compliance.
23 Application: analysis From clause and 7.2.2, virtualisation is allowed Logical separation is allowed. The “or” here means it is acceptable to use shared pool of compute and storage with total virtualisation separation through the use of VLANs, VXLANs, Storage Pools, Resource Pools and virtual firewalls. Virtualisation makes environment cheap and more manageable. In the physical world, it is common practice for FI to mix non production, as it does not make sense (business sense, common sense) to buy 1 server for each environment. Proper testing becomes complex as the various phases (design, development, test) often use the same copy. Cloning a physical server is a long process. Restoring is also difficult with back up is the most common way. But restoring entire OS is a long process. Virtualisation makes snapshot and rollback easier. It can even be done on live production VM running database. With virtualisation, environment becomes cheap. 1 Production VM can easily have 10 Non Production VM. It is easy to clone, snapshot, destroy and replicate. Unit Test, Integration Test, System Test, UAT Test, DR Test, etc can now be created. It is even possible to create the entire environment per user. Virtualisation allows live cloning, while the VMs are running. With virtual networking such as vShield Edge, the non-production clone can be placed inside a private, bubble network. This helps identifying problem quickly in production, as they are identical copy. VMware Tools uses VSS to quiese Windows file systems. Apps such as MS SQL and Oracle also use VSS. With VAAI enabled, the cloning is offloaded to physical storage. This drastically reduced the load on ESXi host, storage fabric, storage front end ports. It also speeds up the cloning. Addressing the security patch test issue The only way to test is to patch it, then run application or regression testing. This is time consuming and risky. Virtual Patch enables IT to buy time. Production is patched virtually. Non Production is patched “physically”. vSphere, via its API, enables virtual patch. Partners like TrendMicro has leveraged this API to patch/close vulnerability without tampering with the guest OS. This enables FI to patch production while testing in non production. Common applications have reference architecture on VMware Oracle, Exchange, SAP, SQL Server, SharePoint, unified messaging, etc.
24 Inventory & Configuration Management This is drastically simplified in “software- defined” architecture Continuous Compliance with automatic remediation is possible.
25 Inventory & Configuration Management Inventory management is drastically simplified in virtual world. In the physical world, the management tool and the items being managed are 2 separate, independent entity. They can be out of sync, which is why a stock take is needed from time to time. In the virtual world, the “items” are defined in software. It can’t exist unless it is defined. For server, a VM cannot exist unless it’s defined in vSphere. Hence inventory has become embedded. For network, instead of many physical switches, you will have just a few distributed virtual switch spanning the entire datacenter. Physical server (the ESXi host) inventory becomes less relevant as it has no data. In fact, the entire hypervisor can also be streamed (not installed), making the entire box is just a physical box with no identity when powered off. Host Profiles provides a mechanism to ensure each ESXi match the master image vCloud Suite provides a built-in inventory of all Servers, Network and Storage vSphere maintains the inventory. VCM tracks the changes in this inventory. SRM maintains the inventory + with its associated DR scripts. VIN maintains the map of application dependancy among VM. This list is integrated to SRM to enable a services-based DR. vShield Manager defines all the firewall rules. vCNS Service Insertion Framework ties functionalities that are otherwise not integrated. While this does not directly answer the inventory management, the inventory is not completely independent as a result. Plug-in from hardware vendors provide single-pane of glass from vSphere UI. vCenter Operations provide a continuous compliance via its Compliance Badge and VCM module.
26 Technology Refresh Plan vCloud Suite normally follow a yearly update, typically after VMworld. This is more frequent than traditional softwares, which goes for 2-4 year upgrade cycle. Hypervisor is a low level software, directly interfacing with hardware. Upgrading hypervisor is best done with new hardware purchase. This turns Upgrade to Migrate, significantly reduce the risk. A virtual DC needs to have a formal refresh strategy Unlike physical DC, a virtual DC changes more frequently. For a large farm (>1000s VMs per datacenter), how many vCloud Suite versions do you allow A good practice is to decouple the hypervisor and the management component. ESXi should get upgraded together with hardware. So it tends to have 3-4 year upgrade cycle. vCenter, and its associated plugins/extension, should get upgraded yearly. This is because it cannot manage a newer ESXi
27 vSphere life cycle New hardware support window: 1.5 years. Non-critical bug fixes: 2 years (from a Major release, not Minor)
28 Change Management
29 Change Management Change Management needs to recognise that virtual world has different changes Changes happen much faster and more frequent in the virtual world. This is because the datacenter itself is defined in software. This is good, as that means IT can support business better. Business normally see IT has a barrier and road block, slowing business instead of powering it. Making it rigid with a blindly applied CM can defeat the purpose of virtualisation. ESXi host change is not the same as physical server change as there is no downtime to the business. Storage upgrade is very different if the storage is virtualised (read: distributed) The storage “array” is now distributed to >1 box, in many cases it extends beyond 8 boxes. vMotion, DRS, Storage DRS should not be considered a change request, This needs to be discussed in greater depth. Do we allow a 1 TB storage vMotion as it will have impact on the Storage Network and Storage IOPS, not to mention performance to the up-stream application. Virtualisation can reduce the risk as the change itself can be virtual. E.g. patching Windows can be done safely if it’s done virtually. This is because it does not change the Guest OS at all. Virtual patch is technically closing the vulnerability via the hypervisor as the hypervisor sees everything.
30 Capacity Management Capacity management changes in the virtual world This is due to the sharing nature and dynamic movement As datacenter services move toward software, they are taking up resources like any other workload. This new type of workload needs to be accounted for in capacity management. Example of datacenter services that run on every ESXi host Security Services Availability Services Management softwares should be separated into a cluster by itself Hence it is not part of the Capacity available to business workload
31 Capacity Management at Infra-Level: IaaS “workload” IaaSComputeNetworkStorage vMotion, Storage vMotionLowHigh VM Snapshot, cloning, hibernateMediumLowHigh Virtual Storage (e.g. VMware VSA, Distributed Storage, Nutanix) MediumHigh Backup (e.g. VDP Advance, VADP) with dedupe -Back up vendor can provide the info -For VDP-Advance, you can see the actual in vSphere or vCenter Operations High Hypervisor-based AV, IDS, IPS, FWMedium Low vSphere ReplicationLowMedium Edge of Network services (e.g. vShield Edge, F5)LowHighLow DR Test or Actual (with SRM)High HA and cluster maintenanceHighN/A Fault Tolerant (VM)HighMediumLow
32 Proper Certification Virtualisation require deep expertise The inter-connected nature of virtualised datacenter means the skills have to be both deep and broad. Most of the advanced certifications from VMware cover much broader, non-virtualisation related aspects of design, imlementation, decision impact analysis, application interdependencies. Considering having advanced certification (e.g. VMware VCAP-DCD, VCAP-DCA) in your team Once you are >75% virtualised, it is safe to say that the virtual platform is the largest platform in the datacenter. It is a mission critical platform. >95% of support cases handled by Mission Critical Support team in VMware Global Support is not about VMware. It is about storage, network, hardware, drivers, etc.
33 End User Computing
34 End User Computing Virtualisation enables control It’s hard to lock a physical laptop. Much easier to control a VM living on the physical laptop. With 4 GB RAM and SSD storage, running a VM on top of the base OS no longer impacts productivity. This also address BYOD. Employee can bring their Mac or Sony or Ubuntu, and running OS that is not supported by IT. IT only supports the Corporate VM, which can be controlled centrally. The separation of personal and work environment can in fact improve productivity as users want to have complete freedom when they are on their non- work environment. They do not want that their surfing Internet on their private time is monitored by corporate firewall. For desktop, it’s easier to control a VDI. The Windows images will not deviate from the standard build as there is only 1 image actually deployed. The Wireless LAN should not be part of corporate intranet. Typical solution is to deploy SSL VPN. This connects the remote device to the corporate network, hence exposing the corporate network. VMware Horizon Suite helps by not using SSL or VPN, and not linking the end device to the corporate network. Access from device will be limited to View Security Server, which only speaks PCoIP and HTTPS protocol. PCoIP does not carry data. It purely sends pixels. And it’s encrypted. All devices have View client or just HTML 5 capable browser All other applications in the end device continue to be separated from corporate network. Only the View client is connected, and it only receive pixel. Every Internet services is considered insecure. DropBox, LinkedIn, FaceBook, and other on-line collaboration improves employees productivity, and they are becoming necessary as their own customers demand that. VMware Horizon Suite provides an IT-controlled dropbox-like file sharing.
36 TRM framework needs to be virtualisation-aware The TRM framework should be aware of virtualisation “System security, reliability, resiliency and recoverability” changed in the virtual world. E.g. the typical HA cluster solution the physical world does not apply in the virtual world. The typical DR solution does not apply too. Roles and Responsibility changes are required when operating a virtual datacenter, as Infrastructure is transformed from “system builder” to “service provider”. Virtualisation Center of Excellence has to be established and their scope include the TRM framework.
37 Scope for PCI Audits Now Includes Virtual Infrastructure Virtual machines Virtual switches/routers Virtual appliances Virtual applications/desktops Hypervisors The PCI DSS security requirements apply to all system components. In the context of PCI DSS, “system components” are defined as any network component, server, or application that is included in or connected to the cardholder data environment. “System components” also include any virtualization components such as virtual machines, virtual switches/routers, virtual appliances, virtual applications/desktops, and hypervisors. “ PCI DSS v2.0 p10
38 Inflexible Longer lead times to deploy and scale Not extensible Complex Physical device sprawl Manual provisioning Fragmented management Costly High CAPEX and OPEX Under utilization of compute Challenges A sample Reference Architecture for Retail PCI
39 Internet / MPLS / Etc Non-PCI PCI Non-PCI PCI Store Zone 1 vSphere PCI Zone vSphere Store Zone 2 vSphere DMZ Non-PCI Zone PCI Zone WEB/DMZAPP DB Layered virtual firewalls provides segmentation and isolation vSphere Distributed Firewall Benefits Avoid perimeter chokepoint using distributed firewall Eliminate traffic U-turns (trombone), which impact latency. Simplify network design using a flat network