Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/25/2018 1:59 AM BRK3219 Meet the most demanding HPC customer needs on Azure with Cycle Computing and Batch Mark Scurrell Principal Program Manager Jason.

Similar presentations


Presentation on theme: "6/25/2018 1:59 AM BRK3219 Meet the most demanding HPC customer needs on Azure with Cycle Computing and Batch Mark Scurrell Principal Program Manager Jason."— Presentation transcript:

1 6/25/2018 1:59 AM BRK3219 Meet the most demanding HPC customer needs on Azure with Cycle Computing and Batch Mark Scurrell Principal Program Manager Jason Stowe Principal Group Program Manager © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

2 Today we’ll cover Moving big compute to the cloud - customer lessons learnt Azure infrastructure for big compute and HPC workloads Azure Batch for SaaS and new cloud-native apps CycleCloud for clustered, enterprise workloads

3 Big Compute and HPC workloads
Industries Aerospace Automotive Science Energy Architecture Financial Services Manufacturing Media & Entertainment Workloads Test execution Car crash simulations Audio & video transcoding Rendering Data ETL Video pre- & post-processing Genomics Financial risk simulations Deep learning & AI training OCR Compiled MATLAB Engineering stress analysis R at scale Oil reservoir simulation Motorsports simulations Image analysis Computational fluid dynamics

4 #1 – Zero waiting in line for compute

5 #2 – Ask questions of any scale
Ask the right question, regardless of scale Customers use 100s to 1,000s Of cores to answer business-critical Questions they couldn’t have done before.

6 #3 – Users with unique requirements are OK
Trivial to support different use cases Different RAM ratios, GPU, FPGA, Application/OS needs Move workloads that don’t fit internally to Cloud

7 #4 – Cloud gets faster & cheaper over time…

8 #5 – Time and Cost are the sole metrics that matter

9 Everything you don’t think about!

10 #6 – Accelerating answers, accelerates people
720 (hours) 720 Computing Analysis 2880 hours / 120 Days to Decision SCALABLE COMPUTING (in hours) 1456 hours / 60.6 Days to Decision 8 ANTICIPATED BENEFIT (in hours)

11 #6 – Accelerating answers, accelerates people
720 (hours) 720 Computing Analysis 2880 hours / 120 Days to Decision SCALABLE COMPUTING (in hours) Higher Quality Output, Iterative Analysis, Less Context Switching Computing & Analysis POST ADOPTION: AGILE DESIGN PROCESS 8

12 #7 – Every smart person gets their own sandbox
User User User User User User User User User User User User Old: Shared internal cluster New: Cluster Per Researcher Competition for resources Waiting in line for compute Shared downtime Remove bottlenecks Cost controls to manage $ No waiting = 2x faster users 12

13 Infrastructure for big compute and HPC

14 Application types Embarrassingly parallel: Tightly coupled:
Applications do not communicate May share common store & data May have dependencies E.g. Monte Carlo simulations, transcoding, rendering Tightly coupled: Applications communicate; mainly use MPI Requires low latency, high bandwidth networking for scale E.g. car crash simulation, fluid dynamics, AI training

15 Azure compute platform
Parallel R CycleCloud SaaS / Client Solution Batch AI Training VFX Plug-Ins Batch Shipyard Azure Batch VM Management & Job Scheduling Platform VMs & VMSS Infrastructure Hardware

16 No-compromise HPC and AI VMs
Up to 16 cores, 3.2 GHz E V3 Haswell processor Up to 224 GiB DDR4 memory FDR InfiniBand (56 Gbps, 2.6 microsecond latency) 2 TB of local SSD ND Up to 4 NVIDIA Pascal P40 GPUs Up to 24 cores Up to 448 GiB memory Up to 3 TB of local SSD FDR InfiniBand NC Up to 4 NVIDIA Tesla K80 GPUs Up to 24 cores Up to 224 GiB memory Up to 1440 GiB of local SSD FDR InfiniBand NCv2 Up to 4 NVIDIA Pascal P100 GPUs Up to 24 cores Up to 448 GiB memory Up to 3 TB of local SSD FDR InfiniBand NV Up to 4 NVIDIA Tesla M60 GPUs Up to 24 cores Up to 224 GiB memory Up to 1440 GiB of local SSD

17 Why InfiniBand RDMA matters?

18 Why InfiniBand RDMA matters?

19 A Visual Cloud Collaboration Platform for Engineering
6/25/2018 1:59 AM A Visual Cloud Collaboration Platform for Engineering Instant on Collaboration Data management Familiar desktop interface On-demand licensing Finite element analysis Analyze and Optimize Assemblies Best-in-class optimization Motion Analysis and Optimization © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

20 Demo

21 Azure Batch

22 Microsoft Build 2017 6/25/2018 1:59 AM Azure Batch Rendering Media transcoding & pre-/post- processing Test execution Monte Carlo simulations Genomics Deep Learning OCR Data ingestion, processing, ETL R at scale Compiled MATLAB Engineering simulations Image analysis & processing Enable applications and algorithms to easily and efficiently run in parallel at scale © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

23 Applications / Algorithms
6/25/2018 1:59 AM Azure Batch Concepts Queue Jobs & Tasks Applications / Algorithms Pool of VMs INPUT OUTPUT © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

24 Azure Batch focus areas
Microsoft Build 2017 6/25/2018 1:59 AM Azure Batch focus areas Capacity on demand Jobs on demand Scale according to load Pay by the minute Efficient Elastic 1 to 10,000’s VMs 1 to millions of tasks No charge for Batch; pay for used resources No head node Use low-priority VMs Cost effective Scale © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

25 Elasticity

26 Scale Number of cores

27 Azure Batch capabilities
Access via API’s, CLI’s, and UI’s: .NET, Java, Node.js, Python, REST PowerShell, x-plat Azure CLI Azure Portal, Batch Labs x-plat client UI Choice of VMs: Windows or Linux Standard or custom images Windows pool can use AHUB (new) Use low-priority VMs (now GA) Rich app management: Get apps from blobs, Batch app packages, package managers, custom VM images Docker container images (coming soon) Pool scaling: Manual or automatic VM networking: Pool VMs can be in a VNET Job scheduling: Supports both embarrassingly parallel and tightly coupled MPI jobs Run > 1 task in parallel per node Detect and retry failed tasks Can set max execution time for jobs and tasks Task dependencies Job prep and cleanup tasks Monitoring: VM monitoring and auto-recover Metrics and logs available via Portal and API

28 Demo: Azure Batch

29 Batch Low-priority VMs – now GA
Microsoft Build 2017 6/25/2018 1:59 AM Batch Low-priority VMs – now GA Cheaper compute - up to 80% discount; fixed price Uses surplus capacity - availability can vary; VMs can be preempted All Batch VM sizes and regions Get work done for lower cost, faster, or do more for same price © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

30 Batch low-priority VMs
Microsoft Build 2017 6/25/2018 1:59 AM Batch low-priority VMs Batch pools: Can contain both low-priority and/or dedicated VMs Scale both VM types independently using resize operation or auto-scale If preemption, pool automatically seeks to target to replace preempted VMs Batch tasks: If VMs preempted, interrupted tasks are automatically re-scheduled & re-executed // Create a Pool with mixture of dedicated and low-priority CloudPool pool = PoolOperations.CreatePool (“Pool1”, “Standard_D14_v2”, vmConfig); pool.TargetDedicatedComputeNodes = 20; pool.TargetLowPriorityComputeNodes = 80; pool.Commit(); // Get Pool info CurrentDedicated = pool.CurrentDedicatedComputeNodes; CurrentLowPriority = pool.CurrentLowPriorityComputeNodes; // Resize Pool – no dedicated, all low-pri pool.Resize(0, 100); © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

31 Batch low-priority VMs
Microsoft Build 2017 6/25/2018 1:59 AM Batch low-priority VMs Suitable workload characteristics: Tolerant of interruption Tolerant of reduced capacity Suitable jobs: Job completion time flexibility Shorter tasks © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

32 Batch low-priority flexibility
Microsoft Build 2017 6/25/2018 1:59 AM Lowest cost Batch low-priority flexibility Lower cost, with guaranteed baseline capacity Lowest cost, while maintaining capacity © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

33 https://github.com/Azure/batch-shipyard
Azure Batch Shipyard Drive Batch using Python command line tool and JSON recipes (no development or API usage required) Supports Linux Docker container images Data movement: Azure Files, Azure Blobs, NFS, GlusterFS Create and manage NFS and GlusterFS file systems

34 https://github.com/Azure/doAzureParallel
R - doAzureParallel Scale up R execution using Batch Parallel backend package for popular foreach package Each iteration of foreach loop runs as a Batch task

35 R - doAzureParallel Hiscox is a leading specialist insurer
Flood risk is key growth area in the US Need to be able to run very large-scale catastrophe research Without Azure Batch, the analysis would take 8 months and would limit their research options With Azure Batch, the research can be run in a 48 hour window; allows Hiscox to extend the research performed and remain competitive in the Flood market

36

37 What does Cycle Do? Hybrid Big Compute Orchestration
Measuring Carbon Footprint in Sub-Saharan Africa Fast on-boarding, no changes to the application Leverage internal & Cloud resources Use case could not be done without CycleCloud 1 3 2 Local Cloud 4

38 The Challenges: Cloud & HPC Big Compute
Existing Workflows Applications Budget Controls User Inputs Security Stack Authorization Data Dependencies Scheduler Scalability Job scripts & data Cloud accounts Storage / Data sources OS variations AD / LDAP Authorization IT LOB Inputs

39 The Solution: CycleCloud
Existing Workflows Applications Budget Controls User Inputs Security Stack Authorization Data Dependencies Scheduler Scalability Job scripts & data Cloud accounts Storage / Data sources OS variations AD / LDAP Authorization Internal Audit/Compliance data Usage data (User, Group, App) Job run-time by instance data AppServer platform IT LOB Inputs

40 Who is Cycle Computing? Leader in Big Compute Cloud Management Software Pioneering Cloud Big Compute/HPC for 10 years Compute hour growth: 7x every 2 years Managing 380M hours in 2016 CycleCloud Value Proposition Simple Managed Access to Big Compute Accelerating Innovation for the Enterprise Faster time to result, with cost control Our customers Fortune 500, startups, and public sector Big 5 Life sciences & pharma, Big 5 financial services, Fortune 100 Insurance, Fortune 100 manufacturing, aerospace & electronics

41 Cycle Unleashes Productivity…
Existing workflows, existing schedulers Instant access to resources Access scale up/down, with control Reliable tools for IT Workflows linking internal and external clouds Tools to manage & control costs Secure, consistent cloud access Ability to link usage to costs Support for Active Directory authorization

42 Cloud BigCompute Platform: Visibility & Reporting
Compute Instances: Performance & cost Data utilization: Multiple Endpoints

43 Transform Business Process…
Fortune 100 Financial Internal 2,500 core cluster CycleCloud Secure 8,000 core Cluster 100,000 hr. jobs, now done in a day on cloud Regulatory workload done 5x faster at 20% the cost

44 Transform Product Development…
Applications QA team #1 Design team #2 Design team #1 Validation team #1 Manage many groups with many users running many applications Increased compute resources 10X 200% faster time to market

45 “Transform Thinking”…
The Broad Institute Internal Cluster Cloud On 8,000 cores in 1.5 hours 2 weeks, from concept to completion Accessing 50,000 cores 30 years of compute in 6 hours

46 A Predictable Process to Production…
Plan Pilot Production Promote Identify workflow & path to Cloud resources Test workflow, Clear success criteria Define controls Finalize security plan, Review results, Move to production Leverage tools to expand user base, new applications

47 Demo

48 CycleCloud Summary Based on Azure infrastructure
Can be used on-premises and on Azure VMs Focus areas: Hybrid on-premises and cloud Run existing HPC environments in the cloud End-to-end solution for IT Governance and management; e.g. core and $$$ quotas If interested, send to

49 Reference Other MS Ignite sessions: More information:
BRK2275: Building a Hollywood blockbuster using Azure Batch rendering BRK4034: Working with models for machine learning and Azure Batch AI BRK2151: Past, present, and future: GPU and AI infrastructure on Microsoft Azure More information: Azure Batch documentation: Azure Hybrid Use Benefit:

50 Please evaluate this session
Tech Ready 15 6/25/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite Phone: download and use the Microsoft Ignite mobile app Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

51 6/25/2018 1:59 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "6/25/2018 1:59 AM BRK3219 Meet the most demanding HPC customer needs on Azure with Cycle Computing and Batch Mark Scurrell Principal Program Manager Jason."

Similar presentations


Ads by Google