Presentation is loading. Please wait.

Presentation is loading. Please wait.

IEEE e-Science 2010 Conference DECEMBER 2010

Similar presentations


Presentation on theme: "IEEE e-Science 2010 Conference DECEMBER 2010"— Presentation transcript:

1 IEEE e-Science 2010 Conference 7 - 10 DECEMBER 2010
Windows Azure for Research Roger Barga, Architect Cloud Computing Futures, MSR

2 The Million Server Datacenter

3 HPC and Clouds – Select Comparisons
Node and system architectures Communication fabric Storage systems and analytics Physical plant and operations Programming models (rest of tutorial)

4 HPC Node Architecture Moore’s “Law” favored commodity systems
Specialized processors and systems faltered “Killer micros” and industry standard blades led Inexpensive clusters now dominate If you look at what most cloud providers are deploying in their data centers and compare with HPC systems used in technical computing, the node architectures are indistinguishable. Intel Nahalem, AMD Barcelona or Shanghai, multiple processors, big chunk of memory on the nodes If you look at what has happened for one particular benchmark, linear systems solver, over the past 15 or so years, the only thing left on the market is x86 based systems. Pretty much everything has been killed. The only thing left is custom hardware, custom interconnect hardware, which is the IBM Blue Gene to the first approximation. Economics is the driver

5 HPC Interconnects Ethernet for low end (cost sensitive)
High end expectations {Nearly} flat networks and very large switches Operating system bypass for low latency (microseconds) Single biggest difference between data center and HPC, is the interconnects Flip forward Higher end systems using Infiniband switching fabric, currently at 40GB/seconds All the high end interconnects have gone out of business It remains to be seen if this technology gets overun by 40GB Ethernet

6 Modern Data Center Network
Internet CR AR S LB Data Center Layer 3 Internet A Layer 2 10 GigE Our standard data center interconnect looks something like the following We’ve got TORS for the local clusters, GigE because that’s the commodity, that’s what’s on the board Then we step up to 10 GigE and build VLANS across that hierarchy Key: CR (L3 Border Router) AR (L3 Access Router) S (L2 Switch) LB (Load Balancer) A (20 Server Rack/TOR) GigE

7 HPC Storage Systems Local disk Secondary storage Tertiary storage
Scratch or non-existent Secondary storage SAN and parallel file systems Hundreds of TBs (at most) Tertiary storage Tape robot(s) 3-5 GB/s bandwidth ~60 PB capacity Another difference is storage  there isn’t much… This is data from LBL, one of the largest sites for Technical Computing in the world 60 PB of tape storage A couple hundred terabytes of spinning storage for secondary storage

8 HPC and Clouds – Select Comparisons
Node and system architectures Communication fabric Storage systems and analytics Physical plant and operations Programming models (rest of tutorial)

9 A Tour Around Windows Azure

10 Azure in Action, Manning Press
Programming Windows Azure, O’Reilly Press Bing: Channel 9 Windows Azure Bing: Windows Azure Platform Training Kit – November 2010 Update

11 Application Model Comparison
Ad Hoc Application Model Machines Running IIS / ASP.NET Machines Running Windows Services Machines Running SQL Server

12 Application Model Comparison
Machines Running IIS / ASP.NET Windows Services SQL Server Ad Hoc Application Model Web Role Instances Worker Role Instances Azure Storage Blob / Queue / Table SQL Azure Windows Azure Application Model

13 Key Components Fabric Controller Compute Storage
Manages hardware and virtual machines for service Compute Web Roles Web application front end Worker Roles Utility compute VM Roles Custom compute role; You own and customize the VM Storage Blobs Binary objects Tables Entity storage Queues Role coordination SQL Azure SQL in the cloud

14 Key Components Fabric Controller
Think of it as an automated IT department “Cloud Layer” on top of: Windows Server 2008 A custom version of Hyper-V called the Windows Azure Hypervisor Allows for automated management of virtual machines

15 Key Components Fabric Controller
Think of it as an automated IT department “Cloud Layer” on top of: Windows Server 2008 A custom version of Hyper-V called the Windows Azure Hypervisor Allows for automated management of virtual machines It’s job is to provision, deploy, monitor, and maintain applications in data centers Applications have a “shape” and a “configuration”. The configuration definition describes the shape of a service Role types Role VM sizes External and internal endpoints Local storage The configuration settings configures a service Instance count Storage keys Application-specific settings

16 Key Components Fabric Controller
Manages “nodes” and “edges” in the “fabric” (the hardware) Power-on automation devices Routers / Switches Hardware load balancers Physical servers Virtual servers State transitions Current State Goal State Does what is needed to reach and maintain the goal state It’s a perfect IT employee! Never sleeps Doesn’t ever ask for raise Always does what you tell it to do in configuration definition and settings

17 Creating a New Project

18 Windows Azure Compute

19 Key Components – Compute Web Roles
Web Front End Cloud web server Web pages Web services You can create the following types: ASP.NET web roles ASP.NET MVC 2 web roles WCF service web roles Worker roles CGI-based web roles

20 Key Components – Compute Worker Roles
Utility compute Windows Server 2008 Background processing Each role can define an amount of local storage. Protected space on the local drive, considered volatile storage. May communicate with outside services Azure Storage SQL Azure Other Web services Can expose external and internal endpoints

21 Suggested Application Model Using queues for reliable messaging

22 Scalable, Fault Tolerant Applications
Queues are the application glue Decouple parts of application, easier to scale independently; Resource allocation, different priority queues and backend servers Mask faults in worker roles (reliable messaging).

23 Key Components – Compute VM Roles
Customized Role You own the box How it works: Download “Guest OS” to Server 2008 Hyper-V Customize the OS as you need to Upload the differences VHD Azure runs your VM role using Base OS Differences VHD

24 Application Hosting

25 ‘Grokking’ the service model
Imagine white-boarding out your service architecture with boxes for nodes and arrows describing how they communicate The service model is the same diagram written down in a declarative format You give the Fabric the service model and the binaries that go with each of those nodes The Fabric can provision, deploy and manage that diagram for you Find hardware home Copy and launch your app binaries Monitor your app and the hardware In case of failure, take action. Perhaps even relocate your app At all times, the ‘diagram’ stays whole

26 Automated Service Management
Provide code + service model Platform identifies and allocates resources, deploys the service, manages service health Configuration is handled by two files ServiceDefinition.csdef ServiceConfiguration.cscfg

27 Service Definition

28 Service Configuration

29 GUI Double click on Role Name in Azure Project

30 Deploying to the cloud We can deploy from the portal or from script
VS builds two files. Encrypted package of your code Your config file You must create an Azure account, then a service, and then you deploy your code. Can take up to 20 minutes (which is better than six months)

31 Service Management API
REST based API to manage your services X509-certs for authentication Lets you create, delete, change, upgrade, swap,…. Lots of community and MSFT-built tools around the API - Easy to roll your own

32 The Secret Sauce – The Fabric
The Fabric is the ‘brain’ behind Windows Azure. Process service model Determine resource requirements Create role images Allocate resources Prepare nodes Place role images on nodes Configure settings Start roles Configure load balancers Maintain service health If role fails, restart the role, based on policy If node fails, migrate the role, based on policy

33 Storage Replicated, Highly Available, Load Balanced

34 Durable Storage, At Massive Scale
Blob - Massive files e.g. videos, logs Drive - Use standard file system APIs Tables - Non-relational, but with few scale limits - Use SQL Azure for relational data Queues - Facilitate loosely-coupled, reliable, systems Durable Storage, At Massive Scale

35 Blob Features and Functions
Store Large Objects (up to 1TB in size) You can have as many containers and Blobs as you want Standard REST Interface PutBlob Inserts a new blob, overwrites the existing blob GetBlob Get whole blob or a specific range DeleteBlob CopyBlob SnapshotBlob LeaseBlob Each Blob has an address

36 Containers Similar to a top level folder Has an unlimited capacity
Can only contain BLOBs Each container has an access level: Private Default, will require the account key to access Full public read Public read only

37 Two Types of Blobs Under the Hood
Block Blob Targeted at streaming workloads Each blob consists of a sequence of blocks Each block is identified by a Block ID Size limit 200GB per blob Page Blob Targeted at random read/write workloads Each blob consists of an array of pages Each page is identified by its offset from the start of the blob Size limit 1TB per blob

38 Blocks You can upload a file in ‘blocks’. Each block has an id.
Then commit those blocks in any order into a blob. Final blob limited to 1 TB, and up to 50,000 blocks. Can modify a blob by inserting, updating, and removing blocks. Blocks live for a week before being GC’d if not committed to a blob. Optimized for streaming. Big.mpg 1 6 8 3 5 4 7 2 Big.mpg

39 Pages Similar to block blobs.
Optimized for random read/write operations and provide the ability to write to a range of bytes in a blob. Call Put Blob, set max size. Then call Put Page. All pages must align 512-byte page boundaries. Writes to page blobs happen in-place and are immediately committed to the blob. The maximum size for a page blob is 1 TB. A page written to a page blob may be up to 1 TB in size.

40 BLOB Leases Creates a 1 minute exclusive write lock on a BLOB.
Operations: Acquire, Renew, Release, Break. Must have the lease id to perform operations. Can check LeaseStatus property. Currently can only be done through REST.

41 Windows Azure Drive Provides a durable NTFS volume for Windows Azure applications to use Use existing NTFS APIs to access a durable drive Durability and survival of data on application failover Enables migrating existing NTFS applications to the cloud A Windows Azure Drive is a Page Blob Example, mount Page Blob as X:\ All writes to drive are made durable to the Page Blob Drive made durable through standard Page Blob replication Drive persists even when not mounted as a Page Blob

42 Windows Azure Drive API
Create Drive - Creates a Page Blob formatted as a single partition NTFS volume VHD. Initialize Cache – Allows an application to specify the location and size of the local data cache for all Windows Azure Drives mounted for that VM instance. Mount Drive – Takes a formatted Page Blob and mounts it to a drive letter for the Windows Azure application to start using. Get Mounted Drives – Returns the list of mounted drives. It consists of a list of the drive letter and Page Blob URLs for each mounted drive. Unmount Drive – Unmounts the drive and frees up the drive letter. Snapshot Drive – Allows the client application to create a backup of the drive (Page Blob). Copy Drive – Provides the ability to copy a drive or snapshot to another drive (Page Blob) name to be used as a read/writable drive.

43 BLOB Guidance Manage connection strings/keys in cscfg
Do not share keys, wrap with a service Strategy for accounts and containers You can assign a custom domain to your storage account There is no method to detect container existence, call FetchAttributes() and detect the error if it doesn’t exist.

44 Table Structure Tables store entities.
Account Table Account: MovieData Star Wars Star Trek Fan Boys Table Name: Movies Brian H. Prince Jason Argonaut Bill Gates Table Name: Customers Entity Tables store entities. Entity schema can vary in the same table.

45 Windows Azure Tables Provides Structured Storage
Massively Scalable Tables Billions of entities (rows) and TBs of data Can use thousands of servers as traffic grows Highly Available & Durable Data is replicated several times Familiar and Easy to use API WCF Data Services and OData .NET classes and LINQ REST – with any platform or language

46 Is not relational Can Not-
Create foreign key relationships between tables. Perform server side joins between tables. Create custom indexes on the tables. No server side Count(), for example. All entities must have the following properties: Timestamp PartitionKey RowKey

47 MIX 09 4/6/2017 Windows Azure Queues Queue are performance efficient, highly available and provide reliable message delivery Simple, asynchronous work dispatch Programming semantics ensure that a message can be processed at least once Access is provided via REST © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

48 Storage Partitioning Understanding partitioning is key to understanding performance Every data object has a partition key Different for each data type (blobs, entities, queues) Partition key is unit of scale A partition can be served by a single server System load balances partitions based on traffic pattern Controls entity locality System load balances Load balancing can take a few minutes to kick in Can take a couple of seconds for partition to be available on a different server Server Busy Use exponential backoff on “Server Busy” Our system load balances to meet your traffic needs Single partition limits have been reached

49 Partition Keys In Each Abstraction
Blobs – Container name + Blob name Every blob and its snapshots are in a single partition Container Name Blob Name image annarbor/bighouse.jpg foxborough/gillette.jpg video Entities – TableName + PartitionKey Entities w/ same PartitionKey value served from same partition PartitionKey (CustomerId) RowKey (RowKind) Name CreditCardNumber OrderTotal 1 Customer-John Smith John Smith xxxx-xxxx-xxxx-xxxx Order – 1 $35.12 2 Customer-Bill Johnson Bill Johnson Order – 3 $10.00 Messages – Queue Name All messages for a single queue belong to the same partition Queue Message jobs Message1 Message2 workflow

50 Replication Guarantee
All Azure Storage data exists in three replicas Replicas are created as needed A write operation is not complete until it has written to all three replicas. Reads are only load balanced to replicas in sync. Server 1 Server 2 Server 3 P1 P1 P1 P2 P2 P2 Pn Pn Pn

51 Scalability Targets Storage Account Single Blob Partition
Capacity – Up to 100 TBs Transactions – Up to a few thousand requests per second Bandwidth – Up to a few hundred megabytes per second Single Blob Partition Throughput up to 60 MB/s Single Queue/Table Partition Up to 500 transactions per second To go above these numbers, partition between multiple storage accounts and partitions When limit is hit, app will see ‘503 server busy’: applications should implement exponential backoff

52 Partitions and Partition Ranges
PartitionKey (Category) RowKey (Title) Timestamp ReleaseDate Action Fast & Furious 2009 The Bourne Ultimatum 2007 Animation Open Season 2 The Ant Bully 2006 Comedy Office Space 1999 SciFi X-Men Origins: Wolverine War Defiance 2008 PartitionKey (Category) RowKey (Title) Timestamp ReleaseDate Action Fast & Furious 2009 The Bourne Ultimatum 2007 Animation Open Season 2 The Ant Bully 2006 Server B Table = Movies [Comedy - Max] Server A [Min - Comedy) Server A Table = Movies [Min - Max] PartitionKey (Category) RowKey (Title) Timestamp ReleaseDate Comedy Office Space 1999 SciFi X-Men Origins: Wolverine 2009 War Defiance 2008

53 Key Selection: Things to Consider
Scalability Distribute load as much as possible Hot partitions can be load balanced PartitionKey is critical for scalability Query Efficiency & Speed Avoid frequent large scans Parallelize queries Point queries are most efficient Entity group transactions Transactions across a single partition Transaction semantics & Reduce round trips See and for more information

54 Expect Continuation Tokens – Seriously!
Maximum of 5 seconds to execute the query Maximum of 1000 rows in a response Maximum of 1000 rows in a response At the end of partition range boundary At the end of partition range boundary

55 Tables Recap Select PartitionKey and RowKey that help scale
Efficient for frequently used queries Supports batch transactions Distributes load Avoid “Append only” patterns Distribute by using a hash etc. as prefix Always Handle continuation tokens Expect continuation tokens for range queries “OR” predicates are not optimized Execute the queries that form the “OR” predicates as separate queries Implement back-off strategy for retries Server busy Load balance partitions to meet traffic needs Load on single partition has exceeded the limits WCF Data Services Use a new context for each logical operation AddObject/AttachTo can throw exception if entity is already being tracked Point query throws an exception if resource does not exist. Use IgnoreResourceNotFoundException

56 Queues Their Unique Role in Building Reliable, Scalable Applications
Want roles that work closely together, but are not bound together. Tight coupling leads to brittleness This can aid in scaling and performance A queue can hold an unlimited number of messages Messages must be serializable as XML Limited to 8KB in size Commonly use the work ticket pattern Why not simply use a table?

57 Queue Terminology

58 Message Lifecycle GetMessage (Timeout) RemoveMessage PutMessage
HTTP/ OK Transfer-Encoding: chunked Content-Type: application/xml Date: Tue, 09 Dec :04:30 GMT Server: Nephos Queue Service Version 1.0 Microsoft-HTTPAPI/2.0 <?xml version="1.0" encoding="utf-8"?> <QueueMessagesList> <QueueMessage> <MessageId>5974b586-0df3-4e2d-ad0c-18e3892bfca2</MessageId> <InsertionTime>Mon, 22 Sep :29:20 GMT</InsertionTime> <ExpirationTime>Mon, 29 Sep :29:20 GMT</ExpirationTime> <PopReceipt>YzQ4Yzg1MDIGM0MDFiZDAwYzEw</PopReceipt> <TimeNextVisible>Tue, 23 Sep :29:20GMT</TimeNextVisible> <MessageText>PHRlc3Q+dG...dGVzdD4=</MessageText> </QueueMessage> </QueueMessagesList> PutMessage Worker Role Queue Msg 1 Msg 2 Msg 1 Msg 2 Web Role POST DELETE Msg 3 Msg 4 Worker Role Worker Role Msg 2

59 Truncated Exponential Back Off Polling
Consider a backoff polling approach Each empty poll increases interval by 2x A successful sets the interval back to 1.

60 Removing Poison Messages
MIX 09 4/6/2017 Removing Poison Messages Producers Consumers C1 P2 1. GetMessage(Q, 30 s)  msg 1 4 3 3 2 1 2 2 1 2 1 1 1 1 1 C2 P1 2. GetMessage(Q, 30 s)  msg 2 © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

61 Removing Poison Messages
MIX 09 4/6/2017 Removing Poison Messages Producers Consumers 1 C1 P2 1. GetMessage(Q, 30 s)  msg 1 5. C1 crashed 4 3 3 2 1 1 2 1 2 1 1 6. msg1 visible 30 s after Dequeue 2 1 C2 P1 2. GetMessage(Q, 30 s)  msg 2 3. C2 consumed msg 2 4. DeleteMessage(Q, msg 2) 7. GetMessage(Q, 30 s)  msg 1 © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

62 Removing Poison Messages
MIX 09 4/6/2017 Removing Poison Messages Producers Consumers 1. Dequeue(Q, 30 sec)  msg 1 5. C1 crashed 10. C1 restarted 11. Dequeue(Q, 30 sec)  msg 1 12. DequeueCount > 2 13. Delete (Q, msg1) P2 C1 4 3 3 1 2 1 2 1 3 1 3 1 2 C2 P1 6. msg1 visible 30s after Dequeue 9. msg1 visible 30s after Dequeue 2. Dequeue(Q, 30 sec)  msg 2 3. C2 consumed msg 2 4. Delete(Q, msg 2) 7. Dequeue(Q, 30 sec)  msg 1 8. C2 crashed © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

63 Queues Recap Make message processing idempotent
No need to deal with failures No need to deal with failures Do not rely on order Invisible messages result in out of order Invisible messages result in out of order Use Dequeue count to remove poison messages Enforce threshold on message’s dequeue count Enforce threshold on message’s dequeue count Use blob to store message data with reference in message Messages > 8KB Batch messages Garbage collect orphaned blobs Use message count to scale Dynamically increase/reduce workers Dynamically increase/reduce workers

64 Windows Azure Storage Takeaways
Data abstractions to build your applications Blobs – Files and large objects Drives – NTFS APIs for migrating applications Tables – Massively scalable structured storage Queues – Reliable delivery of messages Easy to use via the Storage Client Library More info on Windows Azure Storage at:

65 Best Practices

66 Picking the Right VM Size
Having the correct VM size can make a big difference in costs Fundamental choice – larger, fewer VMs vs. many smaller instances If you scale better than linear across cores, larger VMs could save you money Pretty rare to see linear scaling across 8 cores. More instances may provide better uptime and reliability (more failures needed to take your service down) Only real right answer – experiment with multiple sizes and instance counts in order to measure and find what is ideal for you

67 Using Your VM to the Maximum
Remember: 1 role instance == 1 VM running Windows. 1 role instance != one specific task for your code You’re paying for the entire VM so why not use it? Common mistake – split up code into multiple roles, each not using up CPU. Balance between using up CPU vs. having free capacity in times of need. Multiple ways to use your CPU to the fullest

68 Exploiting Concurrency
Spin up additional processes, each with a specific task or as a unit of concurrency. May not be ideal if number of active processes exceeds number of cores Use multithreading aggressively In networking code, correct usage of NT IO Completion Ports will let the kernel schedule the precise number of threads In .NET 4, use the Task Parallel Library Data parallelism Task parallelism

69 Finding Good Code Neighbors
Typically code falls into one or more of these categories: Find code that is intensive with different resources to live together Example: distributed network caches are typically network- and memory- intensive; they may be a good neighbor for storage IO-intensive code Memory Intensive CPU Intensive Network IO Intensive Storage IO Intensive

70 Scaling Appropriately
Monitor your application and make sure you’re scaled appropriately (not over-scaled). Spinning VMs up and down automatically is good at large scale. Remember that VMs take a few minutes to come up and cost ~$3 a day (give or take) to keep running. Being too aggressive in spinning down VMs can result in poor user experience. Trade-off between risk of failure/poor user experience due to not having excess capacity and the costs of having idling VMs. Performance Cost

71 Storage Costs Understand an application’s storage profile and how storage billing works Make service choices based on your app profile E.g. SQL Azure has a flat fee while Windows Azure Tables charges per transaction. Service choice can make a big cost difference based on your app profile Caching and compressing. They help a lot with storage costs.

72 Saving Bandwidth Costs
Bandwidth costs are a huge part of any popular web app’s billing profile Saving bandwidth costs often lead to savings in other places Sending fewer things over the wire often means getting fewer things from storage All of these tips have the side benefit of improving your web app’s performance and user experience Sending fewer things means your VM has time to do other tasks

73 Compressing Content Gzip all output content
All modern browsers can decompress on the fly. Compared to Compress, Gzip has much better compression and freedom from patented algorithms Tradeoff compute costs for storage size Minimize image sizes Use Portable Network Graphics (PNGs) Crush your PNGs Strip needless metadata Make all PNGs palette PNGs Uncompressed Content Gzip Minify JavaScript Minify CCS Minify Images Source for second sub-bullet point: see original slide content below: Gzip all output content All modern browsers can decompress on the fly. Minify Javascript Minify CSS Minimize image sizes Use PNGS Crush your PNGs Strip needless metadata Make all PNGs palette PNGs Compressed Content

74 Best Practices Summary
Doing ‘less’ is the key to saving costs Measure everything Know your application profile in and out Original text below: Doing ‘less’ is the key to saving costs Measure everything—know your application profile in and out Cutting costs is often an art—experiment with different approaches

75 Cloud Computing for eScience Applications

76 NCBI BLAST BLAST (Basic Local Alignment Search Tool)
The most important software in bioinformatics Identify similarity between bio-sequences Computationally intensive Large number of pairwise alignment operations A BLAST running can take 700 ~ 1000 CPU hours Sequence databases growing exponentially GenBank doubled in size in about 15 months.

77 Opportunities for Cloud Computing
It is easy to parallelize BLAST Segment the input Segment processing (querying) is pleasingly parallel Segment the database (e.g., mpiBLAST) Needs special result reduction processing Large volume data A normal Blast database can be as large as 10GB 100 nodes means the peak storage bandwidth could reach to 1TB The output of BLAST is usually x larger than the input However, the large-scale resources required to perform this parallelization are usually unavailable to the majority of researchers. Thus the emergence of Cloud computing provides the potential opportunity to expand the availability of large-scale alignment search to a much larger set of researchers.

78 AzureBLAST Parallel BLAST engine on Azure
Query-segmentation data-parallel pattern split the input sequences query partitions in parallel merge results together when done Follows the general suggested application model Web Role + Queue + Worker With three special considerations Batch job management Task parallelism on an elastic Cloud Wei Lu, Jared Jackson, and Roger Barga, AzureBlast: A Case Study of Developing Science Applications on the Cloud, in Proceedings of the 1st Workshop on Scientific Cloud Computing (Science Cloud 2010), Association for Computing Machinery, Inc., 21 June 2010

79 AzureBLAST Task-Flow A simple Split/Join pattern
Leverage multi-core of one instance argument “–a” of NCBI-BLAST 1,2,4,8 for small, middle, large, and extra large instance size Task granularity Large partition  load imbalance Small partition  unnecessary overheads NCBI-BLAST overhead Data transferring overhead. Best Practice: test runs to profiling and set size to mitigate the overhead Value of visibilityTimeout for each BLAST task, Essentially an estimate of the task run time. too small  repeated computation; too large  unnecessary long period of waiting time in case of the instance failure. BLAST task Splitting task Merging Task

80 Micro-Benchmarks Inform Design
Task size vs. Performance Benefit of the warm cache effect 100 sequences per partition is the best choice Instance size vs. Performance Super-linear speedup with larger size worker instances Primarily due to the memory capability. Task Size/Instance Size vs. Cost Extra-large instance generated the best and the most economical throughput Fully utilize the resource The smallest task, which only contains one sequence, is an order of magnitude slower than that of a large task which contains 100 sequences. After the task granularity is more than 100 sequences per partition the instance is saturated and generates the constant throughput.

81 Blast databases, temporary data, etc.)
AzureBLAST BLAST task Splitting task Merging Task Web Portal Web Service Job registration Job Scheduler Worker Global dispatch queue Web Role Azure Table Job Management Role Azure Blob Database updating Role Scaling Engine Blast databases, temporary data, etc.) Job Registry NCBI databases

82 AzureBLAST Job Portal ASP.NET program hosted by a web role instance
Web Portal Web Service Job registration Job Scheduler Job Portal Scaling Engine Job Registry ASP.NET program hosted by a web role instance Submit jobs Track job’s status and logs Authentication/Authorization based on Live ID The accepted job is stored into the job registry table Fault tolerance, avoid in-memory states

83 Demonstration

84 R. palustris as a platform for H2 production
Eric Shadt, SAGE Sam Phattarasukol Harwood Lab, UW Blasted ~5,000 proteins (700K sequences) Against all NCBI non-redundant proteins: completed in 30 min Against ~5,000 proteins from another strain: completed in less than 30 sec AzureBLAST significantly saved computing time…

85 All-Against-All Experiment
Discovering Homologs Discover the interrelationships of known protein sequences “All against All” query The database is also the input query The protein database is large (4.2 GB size) Totally 9,865,668 sequences to be queried Theoretically, 100 billion sequence comparisons! Performance estimation Based on the sampling-running on one extra-large Azure instance Would require 3,216,731 minutes (6.1 years) on one desktop This scale of experiments usually are infeasible to most scientists Researchers at Seattle Children’s Hospital interested in protein interactions wanted to know more about the interrelationships of known protein sequences. Due to the sheer number of known proteins, nearly 10 million, this was a very difficult question for even the most state of the art computer to solve. When the researchers first approached the XCG team to see if Azure BLAST could help them with this problem, initial estimates showed that it would take a single computer over six years to find the results. Using Azure BLAST, the 10 million protein sequences were split into groups that were distributed across the Azure cloud. In fact, there were so many sequences to work with that it was necessary to distribute these sequences to data centers in multiple countries, spanning different continents. In the end the results were found in about one week using the cloud and this has been the largest research project to date run o

86 Our Approach Allocated a total of ~4000 instances
475 extra-large VMs (8 cores per VM), four datacenters, US (2), Western and North Europe 8 deployments of AzureBLAST Each deployment has its own co-located storage service Divide 10 million sequences into multiple segments Each will be submitted to one deployment as one job for execution Each segment consists of smaller partitions When load imbalances, redistribute the load manually 50 62 Since each deployment can have at most 500 weighted instances

87 End Result The number of total hits is 1,764,579,487
Total size of the output result is ~230GB The number of total hits is 1,764,579,487 Started at March 25th, the last task completed on April 8th (10 days compute) But based our estimates, real working instance time should be 6~8 day Look into log data to analyze what took place… 50 62 Since each deployment can have at most 500 weighted instances

88 Understanding Azure by analyzing logs
A normal log record should be Otherwise, something is wrong (e.g., task failed to complete) 3/31/2010 6:14 RD00155D3611B0 Executing the task 3/31/2010 6:25 Execution of task is done, it took 10.9mins Executing the task 3/31/2010 6:44 Execution of task is done, it took 19.3mins Executing the task 3/31/2010 7:02 Execution of task is done, it took mins 3/31/2010 8:22 RD00155D3611B0 Executing the task 3/31/2010 9:50 Executing the task 3/31/ :12 Execution of task is done, it took 82 mins

89 Surviving System Upgrades
North Europe Data Center, totally 34,256 tasks processed All 62 compute nodes lost tasks and then came back in a group. This is an Update domain ~30 mins ~ 6 nodes in one group SCH_10Million_NE3 North Europe (62 Xlarge) Totally tasks The Job runs well and fast At 3/31 afternoon, all nodes have one task losting

90 Surviving Storage Failures
West Europe Datacenter; 30,976 tasks are completed, and job was killed 35 Nodes experience blob writing failure at same time A reasonable guess: the Fault Domain is working

91 MODISAzure : Computing Evapotranspiration (ET) in the Cloud
You never miss the water till the well has run dry Irish Proverb

92 Computing Evapotranspiration (ET)
Evapotranspiration (ET) is the release of water to the atmosphere by evaporation from open water bodies and transpiration, or evaporation through plant membranes, by plants. Penman-Monteith (1964) ET = Water volume evapotranspired (m3 s-1 m-2) Δ = Rate of change of saturation specific humidity with air temperature.(Pa K-1) λv = Latent heat of vaporization (J/g) Rn = Net radiation (W m-2) cp = Specific heat capacity of air (J kg-1 K-1) ρa = dry air density (kg m-3) δq = vapor pressure deficit (Pa) ga = Conductivity of air (inverse of ra) (m s-1) gs = Conductivity of plant stoma, air (inverse of rs) (m s-1) γ = Psychrometric constant (γ ≈ 66 Pa K-1) Estimating resistance/conductivity across a catchment can be tricky Big reduction calculation Lots of inputs: big data reduction Some of the inputs are not so simple © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION

93 ET Synthesizes Imagery, Sensors, Models and Field Data
FLUXNET curated sensor dataset (30GB, 960 files) Climate classification ~1MB (1file) Vegetative clumping ~5MB (1file) FLUXNET curated field dataset 2 KB (1 file) NASA MODIS imagery source archives 5 TB (600K files) NCEP/NCAR ~100MB (4K files) 20 US year = 1 global year

94 MODISAzure: Four Stage Image Processing Pipeline
Data collection (map) stage Downloads requested input tiles from NASA ftp sites Includes geospatial lookup for non-sinusoidal tiles that will contribute to a reprojected sinusoidal tile Reprojection (map) stage Converts source tile(s) to intermediate result sinusoidal tiles Simple nearest neighbor or spline algorithms Derivation reduction stage First stage visible to scientist Computes ET in our initial use Analysis reduction stage Optional second stage visible to scientist Enables production of science analysis artifacts such as maps, tables, virtual sensors Source Imagery Download Sites Request Queue . . . Download Queue Source Metadata Scientists Data Collection Stage AzureMODIS Service Web Role Portal Scientific Results Download Reprojection Queue Science results Reprojection Stage Derivation Reduction Stage Analysis Reduction Stage Reduction #1 Queue Reduction #2 Queue

95 MODISAzure: Architectural Big Picture (1/2)
<PipelineStage>Job Queue <PipelineStage>JobStatus Persist <PipelineStage> Request MODISAzure Service (Web Role) <PipelineStage>TaskStatus Service Monitor (Worker Role) Parse & Persist Dispatch <PipelineStage>Task Queue ModisAzure Service is the Web Role front door Receives all user requests Queues request to appropriate Download, Reprojection, or Reduction Job Queue Service Monitor is a dedicated Worker Role Parses all job requests into tasks – recoverable units of work Execution status of all jobs and tasks persisted in Tables © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION

96 MODISAzure: Architectural Big Picture (2/2)
<PipelineStage>TaskStatus Service Monitor (Worker Role) Parse & Persist Dispatch <PipelineStage>Task Queue GenericWorker (Worker Role) <Input>Data Storage All work actually done by a Worker Role Dequeues tasks created by the Service Monitor Retries failed tasks 3 times Maintains all task status © 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION

97 Example Pipeline Stage: Reprojection Service
Job Queue ReprojectionJobStatus Reprojection Request Each entity specifies a single reprojection job request Persist ReprojectionTaskStatus Service Monitor (Worker Role) Parse & Persist Each entity specifies a single reprojection task (i.e. a single tile) Dispatch Points to ScanTimeList Task Queue Query this table to get the list of satellite scan times that cover a target tile Reprojection Data Storage SwathGranuleMeta Query this table to get geo-metadata (e.g. boundaries) for each swath tile GenericWorker (Worker Role) Swath Source Data Storage

98 Costs for 1 US Year ET Computation
Source Imagery Download Sites Request Queue Computational costs driven by data scale and need to run reduction multiple times Storage costs driven by data scale and 6 month project duration Small with respect to the people costs even at graduate student rates ! . . . Download Queue Source Metadata Data Collection Stage Scientists AzureMODIS Service Web Role Portal GB 60K files 10 MB/sec 11 hours <10 workers Scientific Results Download $50 upload $450 storage Reprojection Queue Reprojection Stage Derivation Reduction Stage Analysis Reduction Stage 400 GB 45K files 3500 hours 20-100 workers 5-7 GB 5.5K files 1800 hours 20-100 workers <10 GB ~1K files 1800 hours 20-100 workers $420 cpu $60 download $216 cpu $1 download $6 storage $216 cpu $2 download $9 storage Reduction #1 Queue Reduction #2 Queue Total: $1420

99 Observations and Experience
Clouds are the largest scale computer centers ever constructed and have the potential to be important to both large and small scale science problems. Equally import they can increase participation in research, providing needed resources to users/communities without ready access. Clouds suitable for “loosely coupled” data parallel applications, and can support many interesting “programming patterns”, but tightly coupled low-latency applications do not perform optimally on clouds today. Provide valuable fault tolerance and scalability abstractions Clouds as amplifier for familiar client tools and on premise compute. Clouds services to support research provide considerable leverage for both individual researchers and entire communities of researchers.

100 Resources: Cloud Research Community Site
Getting started steps for developers Available research services Use cases on Azure for research Event Announcements Detailed tutorials Technical papers us with questions at

101 Resources: AzureScope
Simple benchmarks illustrating basic performance for compute and storage services Benchmarks for reference algorithms Best Practice tips Code Samples us with questions at

102 Resources: AzureScope
Simple benchmarks illustrating basic performance for compute and storage services Benchmarks for reference algorithms Best Practice tips Code Samples us with questions at

103 Demonstration

104 Azure in Action, Manning Press
Programming Windows Azure, O’Reilly Press Bing: Channel 9 Windows Azure Bing: Windows Azure Platform Training Kit - November Update


Download ppt "IEEE e-Science 2010 Conference DECEMBER 2010"

Similar presentations


Ads by Google