Real World Cloud Application Security

Real World Cloud Application Security chan@netflix.com

About Me Director of Engineering @ Netflix Responsible for: – Cloud app, product, infrastructure, ops security Previously: – Led security team @ VMware – Earlier, primarily security consulting at @stake, iSEC Partners

Netflix, Inc. “Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series...” Source: http://ir.netflix.comhttp://ir.netflix.com

APPSEC CHALLENGES

Lots of Good Advice BSIMM Microsoft SDL SAFECode

But, what works? Forrester Consulting, 12/10

Especially, given phenomena such as DevOps, cloud, agile, and the unique characteristics of an organization?

CLOUD @ NETFLIX

Availability

“Undifferentiated Heavy Lifting”

Netflix Culture “may well be the most important document ever to come out of the Valley.” Sheryl Sandberg, Facebook COO

Scale and Usage Curve

Netflix is now ~99% in the cloud

On the way to the cloud... (architecture)

On the way to the cloud... (organization) (or NoOps, depending on definitions)

DEPLOYING CODE

A common graph @ Netflix Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365 Multiply this pattern across the dozens of apps that comprise the Netflix streaming service Weekend afternoon ramp-up

Solution: Load-Based Autoscaling

Autoscaling Goals: – # of systems matches load requirements – Load per server is constant – Happens without intervention (the ‘auto’ in autoscaling) Results: – Clusters continuously add & remove nodes – New nodes must mirror existing

Every change requires a new cluster push (not an incremental change to existing systems)

Deploying code must be easy (it is)

Netflix Deployment Pipeline Perforce/Git Code change Config change YUM RPM with app-specific bits Bakery/Amina tor Base image + RPM AMI VM template ready to launch ASG Cluster config Running systems

Operational Impact No changes to running systems No systems mgmt infrastructure (Puppet, Chef, etc.) Fewer logins to prod No snowflakes Trivial “rollback”

Security Impact Need to think differently on: – Vulnerability management – Patch management – User activity monitoring – File integrity monitoring – Forensic investigations

Architecture, organization, deployment are all different. What about security?

We’ve adapted too. Some principles we’ve found useful.

POINTS OF EMPHASIS

Points of Emphasis Integrate Make the right way easy Self-service, with exceptions Trust, but verify Two contexts: 1.Integration with your engineering ecosystem 2.Integration of your security controls Organization SCM, build and release Monitoring and alerting 28

Integration: Base AMI Testing Base AMI – VM/instance template used for all cloud systems – Average instance age = ~24 days (one-time sample) The base AMI is managed like other packages, via P4, Jenkins, etc. We watch the SCM directory & kick off testing when it changes Launch an instance of the AMI, perform vuln scan and other checks SCAN COMPLETED ALERT Site name: AMI1 Stopped by: N/A Total Scan Time: 4 minutes 46 seconds Critical Vulnerabilities: 5 Severe Vulnerabilities: 4 Moderate Vulnerabilities: 4

Integration: Control Packaging and Installation From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer  Pulls in the following RPMs:  HIDS agent  Config assessment/firewall agent  Host hardening package  WAF

Integration: Timeline (Chronos) What IP addresses have been blacklisted by the WAF in the last few weeks? GET /api/v1/event?timelines=type:blacklist&start=20130125000000000 Which security groups have changed today? GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000

Integration: Static Analysis Available self-service through build environment – FindBugs, PMD Jenkins plugin to display graphs and support drill through to results

Integration: Static Analysis

Points of Emphasis Integrate Make the right way easy Self-service, with exceptions Trust, but verify Developers are lazy

Making it Easy: Cryptex Crypto: DDIY (“Don’t Do It Yourself”) Many uses of crypto in web/distributed systems: – Encrypt/decrypt (cookies, data, etc.) – Sign/verify (URLs, data, etc.) Netflix also uses heavily for device activation, DRM playback, etc.

Making it Easy: Cryptex Multi-layer crypto system (HSM basis, scale out layer) – Easy to use – Key management handled transparently – Access control and auditable operations

Making it Easy: Cloud-Based SSO In the AWS cloud, access to data center services is problematic – Examples: AD, LDAP, DNS But, many cloud-based systems require authN, authZ – Examples: Dashboards, admin UIs Asking developers to securely handle/accept credentials is also problematic

Making it Easy: Cloud-Based SSO Solution: Leverage OneLogin SaaS SSO (SAML) used by IT for enterprise apps (e.g. Workday, Google Apps) Provides a single & centralized login page Built base module to make SSO/authN trivial

Points of Emphasis Integrate Make the right way easy Self-service, with exceptions Trust, but verify Self-service is perhaps the most transformative cloud characteristic Failing to adopt this for security controls will lead to friction

Self-Service: Security Groups Asgard cloud orchestration tool allows developers to configure their own firewall rules Limited to same AWS account, no IP-based rules

Points of Emphasis Integrate Make the right way easy Self-service, with exceptions Trust, but verify Culture precludes traditional “command and control” approach Organizational desire for agile, DevOps, CI/CD blur traditional security engagement touchpoints

Trust but Verify: Security Monkey Cloud APIs make verification and analysis of configuration and running state simpler Security Monkey created as the framework for this analysis Includes: – Certificate checking – Firewall analysis – IAM entity analysis – Limit warnings – Resource policy analysis

Trust but Verify: Security Monkey From: Security Monkey Date: Wed, 24 Oct 2012 17:08:18 +0000 To: Security Alerts Subject: prod Changes Detected Table of Contents: Security Groups Changed Security Group (eu-west-1 / prod) (eu-west-1 / prod)>

Trust but Verify: Exploit Monkey AWS Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scans On 10/23/12 12:35 PM, Exploit Monkey wrote: I noticed that testapp-live has changed current ASG name from testapp- live-v001 to testapp-live-v002. I'm starting a vulnerability scan against test app from these private/public IPs: 10.29.24.174

Takeaways Netflix runs a large, dynamic service in AWS Newer concepts like cloud & DevOps need an updated approach to application security Specific context can help jumpstart a pragmatic and effective security program Don ’ t swim upstream - integrate and collaborate with your engineering partners

Netflix References http://netflix.github.com http://techblog.netflix.com http://slideshare.net/netflix

Other References http://www.webpronews.com/netflix-outage-angers-customers-2008-08 http://www.pcmag.com/article2/0,2817,2395372,00.asp http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php http://bsimm.com/online/ http://www.microsoft.com/en-us/download/confirmation.aspx?id=29884 http://www.slideshare.net/reed2001/culture-1798664 http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg-calls- maybe-the-most-important-document-ever-to-come-out-of-the-valley/ http://www.gauntlt.org

Questions? ? chan@netflix.com

Real World Cloud Application Security

Similar presentations

Presentation on theme: "Real World Cloud Application Security"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real World Cloud Application Security

Similar presentations

Presentation on theme: "Real World Cloud Application Security"— Presentation transcript:

Similar presentations

About project

Feedback