Clouds will win! Geoffrey Fox Director,

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

Architecture and Measured Characteristics of a Cloud Based Internet of Things May 22, 2012 The 2012 International Conference.
SALSA HPC Group School of Informatics and Computing Indiana University.
International Conference on Cloud and Green Computing (CGC2011, SCA2011, DASC2011, PICom2011, EmbeddedCom2011) University.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
Cyberinfrastructure and Its Application Cyberinfrastructure Day Salish Kootenai College, Pablo MT August Geoffrey Fox
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
1 Clouds and Sensor Grids CTS2009 Conference May Alex Ho Anabas Inc. Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Student Visits August Geoffrey Fox
Cloud computing Tahani aljehani.
Data-intensive Computing on the Cloud: Concepts, Technologies and Applications B. Ramamurthy This talks is partially supported by National.
Banking Clouds V International Youth Banking Forum.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Cloud Computing. Cloud Computing Overview Course Content
1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
X-Informatics Cloud Technology (Continued) March Geoffrey Fox Associate.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
SCSI: Platforms & Foundations: Cyberinfrastructure Socially Coupled Systems & Informatics: Science, Computing & Decision Making in a Complex Interdependent.
X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? January Geoffrey Fox
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
X-Informatics Cloud Technology (Continued) March Geoffrey Fox Associate.
SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Overview of Cyberinfrastructure Northeastern Illinois University Cyberinfrastructure Day August Geoffrey Fox
Science of Cloud Computing Panel Cloud2011 Washington DC July Geoffrey Fox
Software Architecture
Cloud Architecture for Earthquake Science 7 th ACES International Workshop 6th October 2010 Grand Park Otaru Otaru Japan Geoffrey Fox
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Introduction to Cloud Computing
OpenQuake Infomall ACES Meeting Maui May Geoffrey Fox
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Biomedical Cloud Computing iDASH Symposium San Diego CA May Geoffrey Fox
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
ICETE 2012 Joint Conference on e-Business and Telecommunications Hotel Meliá Roma Aurelia Antica, Rome, Italy July
SALSA HPC Group School of Informatics and Computing Indiana University.
Implications of Clouds for Data Intensive Science with application to Biomedical Science I400 Indiana University March Geoffrey Fox
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Virtual Appliances CTS Conference 2011 Philadelphia May Geoffrey Fox
Clouds will win! CTS Conference 2011 Philadelphia May Geoffrey Fox
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Microsoft Cloud Solution.  What is the cloud?  Windows Azure  What services does it offer?  How does it all work?  How to go about using it  Further.
1 Cloud Systems Panel at HPDC Boston June Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Next Generation of Apache Hadoop MapReduce Owen
BIG DATA/ Hadoop Interview Questions.
Big Data Workshop Summary Virtual School for Computational Science and Engineering July Geoffrey Fox
Recap: introduction to e-science
Biology MDS and Clustering Results
Clouds from FutureGrid’s Perspective
Big Data Architectures
Services, Security, and Privacy in Cloud Computing
Panel on Research Challenges in Big Data
MapReduce: Simplified Data Processing on Large Clusters
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Presentation transcript:

Clouds will win! Geoffrey Fox Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington FutureGrid Tutorial at PPAM 2011 Torun Poland September

Important Trends Data Deluge in all fields of science Multicore implies parallel computing important again – Performance from extra cores – not extra clock speed – GPU enhanced systems can give big power boost Clouds – new commercially supported data center model replacing compute grids (and your general purpose computer center) Light weight clients: Sensors, Smartphones and tablets accessing and supported by backend services in cloud Commercial efforts moving much faster than academia in both innovation and deployment

Jaliya Ekanayake - School of Informatics and Computing 3 Big Data in Many Domains  According to one estimate, we created 150 exabytes (billion gigabytes) of data in In 2010, we created 1,200 exabytesone  Enterprise Storage sold in 2010 was 15 Exabytes; BUT total storage sold (including flash memory etc.) was 1500 Exabytes  Size of the web ~ 50 billion web pages: MapReduce at Google was on average processing 20PB per day in January 2008 Size of the web  During 2009, American drone aircraft flying over Iraq and Afghanistan sent back around 24 years’ worth of video footage   New models being deployed in 2010 will produce ten times as many data streams as their predecessors, and those in 2011 will produce 30 times as many  ~108 million sequence records in GenBank in 2009, doubling in every 18 monthsGenBank  ~20 million purchases at Wal-Mart a day  90 million Tweets a dayTweets a day  Astronomy, Particle Physics, Medical Records …  Most scientific task shows CPU:IO ratio of 10000:1 – Dr. Jim Gray  The Fourth Paradigm: Data-Intensive Scientific Discovery The Fourth Paradigm: Data-Intensive Scientific Discovery  Large Hadron Collider at CERN; 100 Petabytes to find Higgs Boson

Data Centers Clouds & Economies of Scale I Range in size from “edge” facilities to megascale. Economies of scale Approximate costs for a small size center (1K servers) and a larger, 50K server center. Each data center is 11.5 times the size of a football field TechnologyCost in small- sized Data Center Cost in Large Data Center Ratio Network$95 per Mbps/ month $13 per Mbps/ month 7.1 Storage$2.20 per GB/ month $0.40 per GB/ month 5.7 Administration~140 servers/ Administrator >1000 Servers/ Administrator Google warehouses of computers on the banks of the Columbia River, in The Dalles, Oregon Such centers use 20MW-200MW (Future) each with 150 watts per CPU Save money from large size, positioning with cheap power and access with Internet

6 Builds giant data centers with 100,000’s of computers; ~ to a shipping container with Internet access “Microsoft will cram between 150 and 220 shipping containers filled with data center gear into a new 500,000 square foot Chicago facility. This move marks the most significant, public use of the shipping container systems popularized by the likes of Sun Microsystems and Rackable Systems to date.” Data Centers, Clouds & Economies of Scale II

X as a Service SaaS: Software as a Service imply software capabilities (programs) have a service (messaging) interface – Applying systematically reduces system complexity to being linear in number of components – Access via messaging rather than by installing in /usr/bin IaaS: Infrastructure as a Service or HaaS: Hardware as a Service – get your computer time with a credit card and with a Web interface PaaS: Platform as a Service is IaaS plus core software capabilities on which you build SaaS Cyberinfrastructure is “Research as a Service” Other Services Clients

Sensors as a Service Cell phones are important sensor Sensors as a Service Sensor Processing as a Service (MapReduce)

Clouds and Jobs Clouds are a major industry thrust with a growing fraction of IT expenditure that IDC estimates will grow to $44.2 billion direct investment in 2013 while 15% of IT investment in 2011 will be related to cloud systems with a 30% growth in public sector. Gartner also rates cloud computing high on list of critical emerging technologies with for example “Cloud Computing” and “Cloud Web Platforms” rated as transformational (their highest rating for impact) in the next 2-5 years. Correspondingly there is and will continue to be major opportunities for new jobs in cloud computing with a recent European study estimating there will be 2.4 million new cloud computing jobs in Europe alone by Cloud computing is an attractive for projects focusing on workforce development. Note that the recently signed “America Competes Act” calls out the importance of economic development in broader impact of NSF projects

Gartner 2009 Hype Curve Clouds, Web2.0 Service Oriented Architectures Transformational High Moderate Low Cloud Computing Cloud Web Platforms Media Tablet

Philosophy of Clouds and Grids Clouds are (by definition) commercially supported approach to large scale computing – So we should expect Clouds to replace Compute Grids – Current Grid technology involves “non-commercial” software solutions which are hard to evolve/sustain Public Clouds are broadly accessible resources like Amazon and Microsoft Azure – powerful but not easy to optimize and perhaps data trust/privacy issues Private Clouds run similar software and mechanisms but on “your own computers” Services still are correct architecture with either REST (Web 2.0) or Web Services Clusters still critical concept

Grids MPI and Clouds Grids are useful for managing distributed systems – Pioneered service model for Science – Developed importance of Workflow – Performance issues – communication latency – intrinsic to distributed systems – Can never run large differential equation based simulations or datamining Clouds can execute any job class that was good for Grids plus – More attractive due to platform plus elastic on-demand model – MapReduce easier to use than MPI for appropriate parallel jobs – Currently have performance limitations due to poor affinity (locality) for compute-compute (MPI) and Compute-data – These limitations are not “inevitable” and should gradually improve as in July Amazon Cluster announcement – Will probably never be best for most sophisticated parallel differential equation based simulations Classic Supercomputers (MPI Engines) run communication demanding differential equation based simulations – MapReduce and Clouds replaces MPI for other problems – Much more data processed today by MapReduce than MPI (Industry Informational Retrieval ~50 Petabytes per day)

Fault Tolerance and MapReduce MPI does “maps” followed by “communication” including “reduce” but does this iteratively There must (for most communication patterns of interest) be a strict synchronization at end of each communication phase – Thus if a process fails then everything grinds to a halt In MapReduce, all Map processes and all reduce processes are independent and stateless and read and write to disks – As 1 or 2 (reduce+map) iterations, no difficult synchronization issues Thus failures can easily be recovered by rerunning process without other jobs hanging around waiting Re-examine MPI fault tolerance in light of MapReduce – Twister will interpolate between MPI and MapReduce

Authentication and Authorization: Provide single sign in to both FutureGrid and Commercial Clouds linked by workflow Workflow: Support workflows that link job components between FutureGrid and Commercial Clouds. Trident from Microsoft Research is initial candidate Data Transport: Transport data between job components on FutureGrid and Commercial Clouds respecting custom storage patterns Program Library: Store Images and other Program material (basic FutureGrid facility) Blob: Basic storage concept similar to Azure Blob or Amazon S3 DPFS Data Parallel File System: Support of file systems like Google (MapReduce), HDFS (Hadoop) or Cosmos (dryad) with compute-data affinity optimized for data processing Table: Support of Table Data structures modeled on Apache Hbase/CouchDB or Amazon SimpleDB/Azure Table. There is “Big” and “Little” tables – generally NOSQL SQL: Relational Database Queues: Publish Subscribe based queuing system Worker Role: This concept is implicitly used in both Amazon and TeraGrid but was first introduced as a high level construct by Azure MapReduce: Support MapReduce Programming model including Hadoop on Linux, Dryad on Windows HPCS and Twister on Windows and Linux Software as a Service: This concept is shared between Clouds and Grids and can be supported without special attention Web Role: This is used in Azure to describe important link to user and can be supported in FutureGrid with a Portal framework

Amazon offers a lot!

Traditional File System? Typically a shared file system (Lustre, NFS …) used to support high performance computing Big advantages in flexible computing on shared data but doesn’t “bring computing to data” S Data S S S Compute Cluster C C C C C C C C C C C C C C C C Archive Storage Nodes

Data Parallel File System? No archival storage and computing brought to data C Data C C C C C C C C C C C C C C C File1 Block1 Block2 BlockN …… Breakup Replicate each block File1 Block1 Block2 BlockN …… Breakup Replicate each block

Important Platform Capability MapReduce Implementations (Hadoop – Java; Dryad – Windows) support: – Splitting of data – Passing the output of map functions to reduce functions – Sorting the inputs to the reduce function based on the intermediate keys – Quality of service Map(Key, Value) Reduce(Key, List ) Data Partitions Reduce Outputs A hash function maps the results of the map tasks to reduce tasks