Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloud Computing for Research Roger S Barga Dennis Gannon, Jared Jackson, Wei Lu, Jaliya Ekanayake, Mohamed Fathalla Extreme Computing Group, MSR.

Similar presentations


Presentation on theme: "Cloud Computing for Research Roger S Barga Dennis Gannon, Jared Jackson, Wei Lu, Jaliya Ekanayake, Mohamed Fathalla Extreme Computing Group, MSR."— Presentation transcript:

1 Cloud Computing for Research Roger S Barga Dennis Gannon, Jared Jackson, Wei Lu, Jaliya Ekanayake, Mohamed Fathalla Extreme Computing Group, MSR

2 Three areas of focus for our team Highly Scalable Research Services Target high value research applications that currently impede progress Release as open source to research community Lower barriers between clouds and research – impedance mismatch Do I really need to rewrite/port my application? Do I really need to know that I am even using a cloud? (client + cloud) Services for data scientists to explore extremely large data sets Data Analytics as a Service Raise the level of abstraction for deploying and using analytics Provide technical support to NSF Computing in Cloud PIs (groups), part of an international program (US, Europe, Asia,…)

3 Computational Resources in Research Lack of Broad Access 70M 1M 14M High Performance Data-intensive Capacity 80% 20% 14M 1M Scientists & Engineers 55M Little to no access to high performance data-intensive capacity

4 Data Explosion in Bioinformatics & Life Sciences Biological Engineering Genomics Environmental Engineering Oceanography, Climate Research NCBI Trace Library

5 NCBI BLAST BLAST (Basic Local Alignment Search Tool) One of the most important software in bioinformatics Identify similarity between bio-sequences Computationally intensive Large number of pairwise alignment operations A normal BLAST running could take 700 ~ 1000 CPU hours For most biologists, two choices to run large jobs Build a local cluster Submit jobs to NCBI or EBI (long job queue times)

6 NCBI BLAST on Windows Azure Parallel BLAST engine on Azure Query-segmentation, data-parallel pattern split the input sequences query partitions in parallel merge results together when done Follows the general suggested application model for Window Azure Web Role + Queue + Worker With three special considerations Batch job management Task parallelism on an elastic Cloud Large data-set management

7 AzureBLAST Task-Flow A simple split/Join pattern Leverage multi-core of one instance argument “–num_threads” of NCBI-BLAST Task granularity Large partition  load imbalance Small partition  unnecessary overheads NCBI-BLAST overhead Data transferring overhead. Best Practice: test runs to profile and determine optimal size… BLAST task Splitting task BLAST task … … Merging Task

8 Micro-Benchmarks Inform Design Task size vs. Performance Benefit of the warm cache effect 100 sequences per partition is the best choice Instance size vs. Performance Super-linear speedup with larger size worker instances Primarily due to the memory capability. Task Size/Instance Size vs. Cost Extra-large instance generated the best and the most economical throughput Fully utilize the resource

9 R. palustris as a platform for H2 production Identify key drivers for producing hydrogen, promising alternative fuel – understand R. palustris well enough to be able to improve its H2 production; Characterize a population of strains and use integrative genomics approaches to dissect the molecular networks of H2 production; BLAST to query 16 strains to sort out genetic relationships Each strain, estimated ~5,000 proteins Jobs kicked off NCBI clusters before completion Against NCBI non-redundant proteins in ~30 min Against ~5,000 proteins from another strain < 30 sec Publishable result in one day for roughly $150. Eric Schadt, Pac Bio and Sam Phattarasukol Harwood Lab, UW

10 All-Against-All Experiment Discovering Homologs BLAST Uniref100, non-redundant protein sequence database Discover the interrelationships of known protein sequences “All against All” query The database is also the input query The protein database is large (4.2 GB size) Total of 9,865,668 sequences to be queried Theoretically, 100 billion sequence comparisons! Performance estimation Estimated completion, 3,216,731 minutes (6.1 years) on 8 core VM One of largest BLAST jobs as far as we know This scale of experiment is usually infeasible to most researchers

11 Our Approach Allocated a total of ~4000 instances 475 extra-large VMs (8 cores per VM) 8 deployments of AzureBLAST Each deployment has its own co-located storage service Divided 10 million sequences into multiple segments Each was submitted to one deployment as one job for execution 300,000 tasks on ~4000 cores on Azure (70,000 bp or 35 sequences per task)

12 Cloud System Upgrades North Europe Data Center, totally 34,256 tasks processed All 62 nodes lost tasks and then came back together. This is an update domain ~30 mins ~ 6 nodes in one group

13 35 Nodes experience blob writing failure at the same time Failures Happen West Europe Datacenter; 30,976 tasks are completed, and job was killed Reasonable guess: Fault Domain is working

14 Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently… Business:“develop, deploy and forget” Researcher“constantly changing codebase, tasks, dependencies” Anthill – Making Azure easier to use for researchers… > AHill myCalc.exe mycalc will run on Azure > AHill myCalc.exe d1 d2 d3… parameter sweep on Azure … > AHill myCalc1 … concurrent execution using a VM pool > AHill myCalc2 … > AHill myCalc3 …

15 Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently… Business:“develop, deploy and forget” Researcher“constantly changing codebase, tasks, dependencies” Anthill – Making Azure easier to use for researchers… Completed Support application parametric sweeps (various patterns) Support for complex data types (any ISerializable type) Support for scheduler fault tolerance (no single point of failure) Ongoing  Complex schedules (workflows), in progress  Prepare for an open release

16 Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently… Business:“develop, deploy and forget” Researcher“constantly changing codebase, tasks, dependencies” Anthill – Making Azure easier to use for researchers… Lessons Learned  Master scheduling work into a pool of slaves is highly efficient  Lightweight workflow to coordinate task flow  Fault tolerance, data movement between tasks (don’t always write results to long term storage, wait to see if future tasks reuse).

17 Excel DataScope Offer data analytics as a service on Windows Azure that enables users to upload and extract patterns from data, identify hidden associations, discover similarities, forecast time series... The project includes an extensible collection of data analytics and machine learning algorithms and runtime service on Azure that scales out the execution of these algorithms. Analysts can submit, sample, and analyze data from Excel through a customizable data analytics ribbon.

18 Offer data analytics as a service on Windows Azure that enables users to upload and extract patterns from data, identify hidden associations, discover similarities, forecast time series... The project includes an extensible collection of data analytics and machine learning algorithms and runtime service on Azure that scales out the execution of these algorithms. Analysts can submit, sample, and analyze data from Excel through a customizable data analytics ribbon. So what are we building… A common framework for implementing analytics algorithms and machine learning, which can efficiently scale out to handle jobs of varying size; Highly efficient MapReduce framework, from batch to streaming/iteration In-memory processing algorithms, whenever possible Minimize I/O overhead Incremental processing Efficient jobs scheduling of MapReduce tasks across a shared pool of Azure VMs; Data services, from partitioning data across VMs to shared read-only working sets; Excel DataScope

19

20 Observations and Experience Clouds are the largest scale compute centers ever constructed and have the potential to be important to large & small scale research. There is an impedance mismatch between clouds and research workloads Equally import they can increase participation in research, providing much needed resources to users and communities which lack ready access. Provide valuable fault tolerance and scalability abstractions Select the best fit VM for the job (CPU / Memory / Network) Guidance, recommendations, examples are just hints Always measure if in doubt…

21 Resources: AzureScope Simple benchmarks illustrating basic performance for compute and storage services Benchmarks for reference algorithms Best Practice tips Code Samples us with questions at

22 Resources: AzureScope Simple benchmarks illustrating basic performance for compute and storage services Benchmarks for reference algorithms Best Practice tips Code Samples us with questions at

23 © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "Cloud Computing for Research Roger S Barga Dennis Gannon, Jared Jackson, Wei Lu, Jaliya Ekanayake, Mohamed Fathalla Extreme Computing Group, MSR."

Similar presentations


Ads by Google