Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International.

Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’11) Grenoble, France - October 12th, 2011 © Orange Labs, Research & Development 2010 Large Magellanic Cloud (Distance: 160, 000 light -years) Source: NASA ? Safe & Secure Clouds

2 The Dark Cloud: Recent Threats on Cloud Dependability  Availability: –Major outage on Amazon EC2 storage (2011). –DDoS attack on AWS brings Bitbucket services to a halt (2009).  Inter-VM attacks: –Hey You! Get off My Cloud! on Amazon VMs [Ristenpart et al., CCS’09].  Hypervisor subversion: –Virtunoid: KVM isolation breakout [Elhage, DEFCON’11]. –CloudBurst: VMware guest VM escape [Kortchinsky et al. BLACKHAT’09]. –Bluepill: rogue hypervisor beneath VMs [Rutkowska et al., BLACKHAT’06]. –SubVirt: VM-based rootkit [King et al., Security&Privacy’06].  Crimeware-as-a-Service: –EC2 cloud used against Sony’s PlayStation Network (2011). –Rootkits (SpyEye: 2011), botnets (ZeuS: 2009) in clouds.  Network security: –Critical vulnerability in Eucalyptus open source cloud (2011). Where are we on cloud dependability?

3 Agenda  A Short Walk through the Clouds  Dependability Barriers  An Autonomic Perspective on Solutions –Self-Healing Clouds –Self-Protecting Clouds  Open Issues and Perspectives © Orange Labs, Research & Development 2010

4 A Short Walk through the Clouds

5 Definition of Cloud Computing*  Technological vision: Cloud computing is a model for enabling on-demand network access to a shared pool of virtualized computing resources (networks, servers, storage, applications, devices/mobiles and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction (self-service model through API or web portals)  Market vision (XaaS): same + pay-per-use (or pay-as-you-go) billing models 5 characteristics 1. On Demand Self-Service 2. Broad Network Access 3. Virtualized Resource Pooling 4. Rapid Elasticity 5. Measured Service 3 delivery models (markets) 1. Cloud Software as a Service (SaaS) 2. Cloud Platform as a Service (PaaS) 3. Cloud Infrastructure as a Service (IaaS) 4 deployment models 1. Private cloud 2. Public cloud 3. Hybrid cloud 4. Community Cloud 3 pivotal technologies 1. Virtualization 2. Autonomics (automation) 3. Grid Computing (job scheduling) * Adapted from NIST definition

6 Technologies, Visions, Usages, Markets Server Virtualization Storage Virtualization Network Virtualization Desktop / Laptop Virtualization Device / Embedded Virtualization Autonomics Grid Computing P2P Cloud Computing ~ On-demand Computing Utility Computing Elastic Computing Scalable Computing Future Internet Internet of Services Green IT Enterprise Cloud Mass market Clouds Technologies Data Center Consolidation Public / Private / Virtual Private / Hybrid Clouds Compute / Storage Clouds Data Center Resource (servers, routers) ~ Computer Center Hosting Facility, Server Farm, Data Farm Desktops / Laptops Other Devices (mobile, network equipments) Visions Usages Markets Physical Resources SaaS (Service, Apps) PaaS (Platform) IaaS (Infrastructure) RaaS (Resource) FaaS (Facility)

Drivers for hosters  Consolidation through virtualization and automation allows for: –Ease and speed up (un)provisioning drastically –Maintain far larger IT infrastructure –Reduction of risk of human errors –Possibility of better energy management –… altogether tailor IT infrastructure to the need and optimized OPEX/CAPEX © Orange Labs, Research & Development 2010 security/ compliance faster deployment of new applications cost reductions / budget constraints IT management IT optimization rapid adaptation to real needs

8 Dependability Barriers

9 Barriers  Lack of technical maturity –SLA, auto-scaling/auto-sharing, performance, availability, dependability –PaaS: applications packaging, deployment, management, test, configuration management… –Storage –Network –Security, privacy  Major risks of lock-in –Lack of standards (API, programming models) –Lack interoperability –Lack of portability (applications, mgt tools)  Legal Issues –Software licences –Data location (eg government, health)  Integration with legacy IT (IS)  Huge investments in data centers - building, hardware, cooling, energy

10 The cloud raises many security issues

11 Risks are multi-faceted, both for the customer and the provider

12 Cloud computing security: 10 major roadblocks ahead End-point Security Ensure security of virtualized computing infrastructures Important to critical End-point Security Ensure security of virtualized computing infrastructures Important to critical Data Protection Ensure data security/privacy in a shared context Important to critical Data Protection Ensure data security/privacy in a shared context Important to critical Network Security Ensure security of virtualized network infrastructures Incremental Network Security Ensure security of virtualized network infrastructures Incremental Trust Enablers Prove to users that clouds are trustworthy Critical Trust Enablers Prove to users that clouds are trustworthy Critical Identity management Privacy Data traceability Legal & regulatory issues Transparency & compliance Openness End-to-end securityNetwork isolation Hypervisor security Elastic security CriticalImportantIncremental CriticalImportantIncremental CriticalImportantIncremental CriticalImportantIncremental Critical: The roadblock is essential to lift for adoption. Almost no known solutions. Important: Lifting the roadblock is a major step forward. Some first solutions. Incremental: The roadblock may be overcome by small enhancement of already existing technologies.

13 End-point security Guarantee security when computing resources are virtualized. VMs have also their own vulnerabilities. Those threats may be mitigated by:  Hardened images.  Strict VM security life-cycle management.  Apply security-by-default configurations. Virtualization brings many threats.  Safe in theory.  But… Hyperjacking, misconfigurations, malicious device drivers, backdoors between VM and hardware… Barrier #1: Hypervisor security

14 Network security Guarantee security when network resources are virtualized. ? ? ? ? Isolation is no longer physical but logical.  Isolation is less precise.  Security guarantees are weaker. Challenge: map existing network security components to new cloud architectures. Are traditional security architectures still effective in a virtual environment?  Risks similar to known networks.  Known counter-measures are thus still applicable. VPNs, VLANs, firewalls, IDS/IPS, encryption, signature… Barrier #2: Network isolation

15 Network security Guarantee security when network resources are virtualized. Automated security management is still lacking.  Research: autonomic, self-protecting security architectures.  Operational: early warning systems, integrated proactive security architectures. Flexible security provisioning to match fast evolving risks.  First solutions: Flexible management of VPNs. Overlays in full network virtualization. SIEMs. Barrier #3: Elastic security

16 Data protection Guarantee data security in a shared multi-tenant environment. Lack of end-to-end identity management (in and across data-centers).  Barriers: scalability, heterogeneity, interoperability.  Authentication: should be overcome.  Authorization: in its infancy.  A Security-as-a-Service opportunity. Barrier #4: Identity Strong isolation throughout the lifecycle of personal information.  Many tough questions: Need-to-know enforcement, secure data storage, data retention and destruction, legal implications…  Today’s PETs are not enough! Barrier #5: Privacy

17 Data protection Guarantee data security in a shared multi-tenant environment. Barrier #6: Traceability Being unable to locate the data adds to the loss of control over IT resources.  Legal, political, trust issues: Compliance, data hosted abroad exposed to foreign governments. Inability to prove that data comes from a trusted source.  A widely unchartered area. Barrier #7: Legal issues Multiple conflicting jurisictions over the cloud data flows.  Cloud providers: have trouble providing assurance of compliance with regulations.  Customers: have trouble understanding the rights and obligations of each party.  Importance of security SLAs.

18 Trust enablers Prove to third parties that the cloud infrastructure is trustworthy. Source: Gartner. Analyzing the Risk Dimensions of Cloud and Saas Computing, 2010. Barrier #8: Transparency Prove security hygiene of provider infrastructure to third parties.  Auditability, certification process, risk analysis methodologies, compliance.  Trusted cloud computing technologies provide cryptographic evidence.  Clear-cut SLAs to clarify responsibilities. Barrier #9: Openness Avoid lock-in, interoperate with other cloud infrastructures.  Issues: API portability across providers, interoperability, scalability.  Enabler of the « inter-cloud » vision (« sky computing »).  Flexibility and security benefits of open source cloud architectures.

19 Source: Cloud Security Alliance. Security Guidance for Critical Areas of Focus in Cloud Computing, 2009. Trust enablers Prove to third parties that the cloud infrastructure is trustworthy. Barrier #10: End-to-end security Seamless orchestration of multiple fragmented security building blocks.  Definition of a cloud reference security architecture providing a consistent overall view of cloud security.  Importance of standardization.

20 Cloud Security Today Security = #1 barrier for cloud computing adoption  To develop trust of cloud customers, cloud security should be… StrongFlexibleEfficientSimple  Multiple dangers  No inside / outside frontier  Fast evolving threats  Multiple security goals  Integrated security management  Easy administration  Good performance  Cost-effective security

21 StrongFlexibleEfficientSimple Cloud Security Today  … but unfortunately today cloud security is: Not Strong Not Flexible Not Efficient Not Simple  Hard-to-detect vulnerabilities  Hard-to-place and synchronize countermeasures  Static configurations, low reactivity  No generic security supervision architecture  Complex, heterogeneous security mechanisms  Manual administration nightmare  Costly, error-prone management of security Traditional approaches to protection are not enough!

22 Towards Solutions ? Self-Healing, Self-Protecting Clouds

23 Autonomic Cloud Dependability Management An automated vision of cloud dependability: self-healing, self-defending clouds  Approach (for security):  Benefits: –Simple, strong, flexible, autonomous cloud safety for its customers. –Lighter administration, increased reactivity and agility, lower operating costs. –Graduated response to threats / to faults. –Enabler for integrated supervision.

24 Resilience and Fault-Tolerance in Cloud Computing  What we do need for cloud resilience is –Failure statistics from operational cloud environments –Fault models –Fault tolerance algorithms and mechanisms – replication, backup, disaster recovery, self-repair…  … that deal with cloud specificities –Open, dynamic, virtualized environments : VMs can move between physical servers, so does latency and hardware reliability –Scalability and multi-tenancy: thousands of applications (each generally made of multiple VMs) hosted on a single platform, resources actually shared by multiple applications/users –Layered infrastructures (SaaS/PaaS/IaaS) with multiple administrative roles/domains with limited control offered to application developpers –Pay-per-use: you have to pay for ressources you use (VMs)!

25 Where are we on cloud resilience ?  A huge background on distributed systems dependability, including in grid and autonomic computing (self-repair), but few works specifically targeting cloud reliability –e.g. 0 paper in IEEE Cloud 2009, 1 paper in IEEE Cloud 2010, 1 paper in IEEE Cloud 2011  Some illustrative work-in-progress –Server failures characterization with objective of characterizing the complete data center hardware (server, storage, network) reliability model [Microsoft Research, ACM Symposium on Cloud Computing’2010] –3-replicas ring with scatter placement algorithm of backups of cloud management nodes [U. Maryland, IBM Research, IEEE IPDPS’2009] –Crash and timing fault tolerance (not Byzantine faults) through strong replica consistency of applicative nodes (semi-active and semi-passive replication) [U. Cleveland, UC Santa Barbara, IEEE Cloud’2010] –Bizantine fault tolerance through 3f+1 active replication in community (‘’volontary’’) cloud [U. Hong Kong, NUDT, IEEE Cloud’2011] –Adaptive fault tolerance (forward and backward recovery) in real time cloud through runtime continuous reliability assessment of nodes and replication with variants (VMs) [INRIA, IEEE World Congres on Services’2011]

26 Virtual Appliance Management Platform (VAMP)  VAMP is a PaaS Application Lifecycle Platform (ALM) under development inside Orange Labs.  VAMP targets the construction (e.g. VM image generation), deployment and management of distributed applications in the cloud.  VAMP is based on a architectural description (component-based with the Fractal component model) of applications that is used throughout the complete lifecycle of applications (including ‘model@runtime’) Control Plane (VAMP) Applicative Plane (legacy application) Reification and control Components

27 VAMP Runtime Architecture deployment manager configurator agent VM0 VM1 VM2 Control Plane (VAMP) Applicative Plane (legacy application) Message Oriented Middleware (MOM) VAMP manager VAMP manager creates and repair Deployment Managers. VAMP manager may be inside ou outside the IaaS Deployment Manager (1/application) create applicative VMs and recreates VMs when they fail Configurator Agents self- configure the application

28 Reliable Self-configuration in VAMP (work in progress)  Self-configuration protocol –At VM creation, each component C (managed by a Configurator inside a VM) : – Announces itself to the configurator network; – Configures locally the applicative elements; – Exports its server interfaces to its “client components” which can then bind to C (‘bind’ signals); – Once all C client interfaces are bound, C starts and notifies its start to its “client components” (‘start’ signals). –Gradually (“epidemically”), the complete application is started. –When a VM failure is detected, the faulty VM is replaced by a new VM which is simply introduced in the self- configuration protocol as if it was in its initial deployment phase (except that ‘start’ signals are interpreted as ‘restart’ by already running components).

29 Synthesis on VAMP  A decentralized approach based on asynchronous and reliable communications. –Each VM evolves in parallel at its own pace. –No need for global synchronization between VMs.  Replication (primary-backup schema, passive replication) of DMs complete the self-configuration protocol for complete application reliability.  The approach works for VM crash, transient network failures, and for stateless components.  The self-configuration protocols is also a (basic) self-repair protocol: a running application is seen as in a continuous deployment process.

30 Attack targets and countermeasures Hypervisor vSwitch Hardware VM STORAGE VNIC CPUMEMORYNETWORK PNIC MAC spoofing/snooping:  Static address allocation. IP attacks:  Virtual firewalling. VLAN hopping:  Physical traffic segregation. Hyperspacing:  MMU, IOMMU. Hyperthreading:  No hyperthreading. Buffer overflows:  No eXecute bit. Hyperjacking:  High attack surface:  Certification, open, modular solutions.  Hypervisor integrity:  TPM Secrecy violations:  Authentication, signature. Integrity violations:  Encryption of stored data (self-encrypting drives). DoS (resource starvation):  Memory overcommit, optimized page sharing, balloon drivers…  Quotas, priorities.

31 Detection: VM Introspection BenefitsIssues Monitor VM behavior, detect unusual behaviors.Little remediation actions (simple actions, e.g., restart, kill VM). Rely on VMM to measure VM integrity. What happens if VMM is compromised? In context measurement is difficult (real- time monitoning) Semantic gap problem. May require to add hooks in the VM.Stealth is difficult to achieve. In-VM monitoring: Proximity to monitored target. Possibility of corruption by malware. How to protect monitoring component? In hypervisor: Users cannot kill monitoring component. May require to modify VMM. In management VM: Performance improvement (offload philosophy) Less reactive? 2. monitoring agent Protection target: VM Systems With hooks in VM: Lares, XenAccess, KVMSec With no hooks in VM: CloudSec In VM monitoring code: SIM Offline: LiveWire Miscellaneous: Re-virt (log), Ether, HIMA 1. hook Monitored VM Management VM 1. Monitoring agent Hypervisor 2. Monitoring agent 3. Monitoring agent 2/3. hook ? ?

32 Detection: Trusted Computing BenefitsIssues High assurance, strong isolation.Protection of hypervisor remains undefined, e.g., malicious host OS drivers. Flexibility: allows to support different security requirements / policies. Problem of a sotware-only approach: e.g., how to define a dynamic root of trust. May be lifted if TPM is present. Efficient.Compatibility with legacy systems? Attestation capabilities.Dynamic monitoring is difficult: in-context measurement. 2. monitoring agent Protection target: VM Systems Trusted VMM: Terra + TPM In management VM: vTPM Certification: seL4 1. hook Monitored VM e.g., for integrity Management VM Hypervisor 2. Monitoring agent 1. Monitoring agent Host OS drivers ??

33 Reaction: Sandboxing BenefitsIssues Good performance (by placing performance-critical code in kernel or using hardware virtualization). Strong security against malicious driver behavior in hypervisor (avoid DMA attacks). Difficult to protect RM against attacks if malicious code in hypervisor (if no hardware virtualization is used). Security policy in the RM is difficult to configure by hand (benefits of the autonomic approach). Flexible isolation policies. Flexible frontier between u-driver and k-driver (Micro-Drivers). Reduced code size: executing part of the drivers in user-mode may reduce amount of kernel code. No remediation. Limited to fault / attack containment. Requires modification of kernel. 2. monitoring agent Systems RM between driver and kernel: Software Fault Isolation (SFI) techniques RM between driver and user space: MicroDrivers Proxos RM between driver and device: Nooks 1. hook VMs Management VM 1. Reference Monitor (RM) Hypervisor / Host OS (Kernel) Driver Protection target: Hypervisor VM Host OS (Userland) Device 2. RM 2. Driver 3. RM ? ? ?

34 Reaction: Virtualization BenefitsIssues Strong security and isolation (separation from protection target, i.e., one layer below). Stealthy (ring -1 technique, for VM protection). How to protect the hypervisor layer? Nested virtualization but enormous performance penalty. Hypervisor protection mechanisms remains very limited (e.g., memory hiding in VASP/HBSP). No driver code modification. Compatibility with legacy code. Flexibility: open hypervisor APIs to customize memory management, execution control. Performance overhead may be reduced by hardware virtualization. Heavy performance penalty. 2. monitoring agent Protection target: VM / Hypervisor Systems VM protection (guest OS I/O mediation): BitVisor VM protection (guest OS isolation): HBSP/VASP VM/hypervisor protection (driver virtualization): iKernel (full), LeVasseur et al. (partial). Hypervisor protection (nested virtualization): Turtles 1. hook Management VM Hypervisor / Host OS VM Driver Stub VM Hyper-Hypervisor Driver Stub VM

35 Virtual Security Appliances  Design issues: –Which network security architecture? (physical, virtual, hybrid…) –Which software layer? (vSwitch, hypervisor, VM, multi-layer…) Architecture #1: Physical Architecture #2: Virtual Architecture #3: Hybrid

36 Autonomic Management of IaaS Infrastructure  Results: –First framework for building self-protecting IaaS infrastructure. –Prototype running on KVM hypervisor – extensible to others (VMware). Flexible confinement of VMs according to risk level

37  Autonomic security management of IaaS infrastructure.  Orchestration patterns to compose autonomic loops: –Between views. –Between layers.  Pattern types : centralized, hierarchical, P2P...  Difficult (impossible?) administration by hand.  2 autonomic loops: network-level and device-level. Vertical orchestration patterns Horizontal orchestration patterns Extension to Multiple Loops Self-protection loops

38  Interference between loops: –Physical layer: node isolation in a VLAN (network quarantine). –Security appliance: update attack signature database.  Framework overview:  A hierarchical + layered cloud security self-management model.  Difficult (impossible?) administration by hand.  2 autonomic loops: network-level and device-level. Orchestration of Multiple Loops

39 Open Issues and Perspectives

40 Open Issues: Detection  Place of detection mechanisms: which monitoring granularity and scope? –In which layer? hypervisor (VM introspection), VM (self-defending VMs), hybrid. –Where in the cloud architecture? host-based, network-based, hybrid. –Coordination of detection mechanisms. –Should the monitoring system be co-located with the protection target?  Semantic gap: –Correlation of monitored information across layers. –Possible approach: ontologies and Semantic Web.  Monitoring intrusiveness: –Performance impact? –Stealth of detection mechanisms?  Forensics: real-time detection vs. post-mortem analysis?  Malware detection: a new field in the cloud not yet understood.

41 Open Issues: Reaction  Place of reaction mechanisms (as for detection).  Adaptable reaction mechanisms: –Defense configuration remains static / complex, e.g., for network security. –Emergence of cloud networking. Issues: real-time virtual network reconfiguration; on- demand chaining of security services; security of network management interface.  Influence of VM migration: long distance migration (across data centers, across clouds). Migration of the security state consistently, securely, and efficiently (e.g., IP address space)?  Remediation mechanisms: Currently limited to threat containment, with no elimination.  Hypervisor protection: still little addressed. –Isolation of device drivers: rich literature in the system community (virtualization, sandboxing, language techniques, new kernel architectures...) is applicable. –Assurance: control flow integrity, attestation, compliance proofs, trusted computing.

42 Open Issues: Decision  Multi-lateral security policy management: –Policy definition: despite advances, current policies are not sufficiently flexible for the cloud and the inter-cloud setting. Need for policies spanning organization boundaries, geographical borders, applicable in multiple contexts, with multiple actors. –Automated policy aggregation and deployment.  Security adaptation strategy: –Flexible policy negotiation: Some first frameworks (e.g., XENA), but a lot still to be done. Issues: interoperability, multiple conflicting stakeholder responsibilities, multiple jurisdictions. Promising solution: ontologies. –Strategy representation: notion of policy continuum, DSLs.  Coordination of multiple self-protection loops: Stability of the result? –Event correlation from monitoring components? Decision synchronization towards between reaction components? –Authentication of communications: detection / decision; decision / reaction. –Lack of security supervision architecture: for the cloud, and the inter-cloud setting.  Learning from past attacks to improve security and build defenses against future threats.

43 Open Issues: Resilience  Main roadblocks toward cloud resilience: –Cloud layered architecture with different responsabilities: coordination and consistency of mechanisms offered by platforms versus mechanisms programmed by applications? –Reliability of cloud management services themselves. –Applicative models (stateless/statefull, strong/weak coupling) and fault-tolerance?  Self-stabilization and cloud resilience? –Applicativity of self-stabilization in cloud? – At the hypervisor level? At VM and applicative level? – Openness of cloud systems? –Cost (w.r.t. performance) of self-stabilization in cloud?

44 Conclusion  In the cloud, security is clearly not an option: –Migrating critical data and applications to the cloud is and remains risky. –Check (and stick to) best practices!  The main challenge of cloud dependability remains trust: –A lot of building blocks already there, but still a long way to go. –Importance of SLAs to clarify responsibilities between customer and provider. –Self-healing, self-protecting cloud architectures can help, both in specification and enforcement of such SLAs. –Strong security, reliability, transparency, and accountability will be the key enablers to maintain solid and durable trust between a CSP and its customers.

Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International.

Similar presentations

Presentation on theme: "Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International.

Similar presentations

Presentation on theme: "Cloud Dependability: How Close Are We? Challenges and Research Issues Towards Dependable Clouds Marc Lacoste, Thierry Coupaye Orange Labs International."— Presentation transcript:

Similar presentations

About project

Feedback