11 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Old Wine: Stale or Vintage? April 14, 2011 All Slides © IG.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Distributed Systems Architectures Slide 1 1 Chapter 9 Distributed Systems Architectures.
High Performance Computing Course Notes Grid Computing.
1 On Death, Taxes, & the Convergence of Peer-to-Peer & Grid Computing Adriana Iamnitchi Duke University “Our Constitution is in actual operation; everything.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
8.
Active Directory: Final Solution to Enterprise System Integration
Distributed Systems Architectures
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Computer System Architectures Computer System Software
CS 425: Distributed Systems Lecture 27 “The Grid” Klara Nahrstedt.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
DISTRIBUTED COMPUTING
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Grid Computing - AAU 14/ Grid Computing Josva Kleist Danish Center for Grid Computing
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Objectives Functionalities and services Architecture and software technologies Potential Applications –Link to research problems.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
Tools for collaboration How to share your duck tales…
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
9 Systems Analysis and Design in a Changing World, Fourth Edition.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID ARCHITECTURE Chintan O.Patel. CS 551 Fall 2002 Workshop 1 Software Architectures 2 What is Grid ? "...a flexible, secure, coordinated resource- sharing.
1 Distributed Databases BUAD/American University Distributed Databases.
Authors: Ronnie Julio Cole David
State Key Laboratory of Resources and Environmental Information System China Integration of Grid Service and Web Processing Service Gao Ang State Key Laboratory.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7. Grid Computing Systems and Resource Management
Managing Enterprise GIS Geodatabases
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
11 Indranil Gupta (Indy) Lecture 4 Cloud Computing: Older Testbeds January 28, 2010 CS 525 Advanced Distributed Systems Spring 2010 All Slides © IG.
Object storage and object interoperability
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
1 Indranil Gupta (Indy) Lecture 4 The Grid. Clouds. January 29, 2009 CS 525 Advanced Distributed Systems Spring 09.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Towards ‘Ubiquitous’ Ubiquitous Computing: an alliance with ‘the Grid’ Oliver Storz, Adrian Friday, and Nigel Davies Computing Department, Lancaster University,
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
An Open Source Project Commonly Used for Processing Big Data Sets
Example: Rapid Atmospheric Modeling System, ColoState U
Peer-to-peer networking
The Globus Toolkit™: Information Services
Ch 4. The Evolution of Analytic Scalability
Presentation transcript:

11 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Old Wine: Stale or Vintage? April 14, 2011 All Slides © IG

Wisdom vs. Whipper-snapper Plenty of wisdom has evolved for distributed systems design over the decades Important to revisit these every once in a while and test them. 2

33 A comparison of approaches to large-scale data analysis A. Pavlo et al SIGMOD 2009 (there is also a shorter paper in CACM, Jan 2010)

44 Databases vs. Clouds Basic Question: Why use MapReduce (MR), when parallel databases have been around for decades and have been successful? Written by experts in databases research, including those who defined relational DBs many decades ago. Some in databases community felt they might be hurt by these new-fangled cloud computing paradigms which seemed to be merely reiventing the wheel

5 Relational Database Consists of schemas –Employee schema, Company schema Schema = table consisting of tuples – Tuple = consists of multiple fields, including primary key, foreign key, etc. – SQL queries run on multiple schemas very efficiently 5

66 Parallel DB is similar to MR Parallel DBs = parallelize query operation across multiple servers Parallel DBs: data processing consists of 3 phases –E.g., consider: joining two tables T1 and T2 1.Filter T1 and T2 parallelly (~ Map) 2.Distribute the larger of T1 and T2 across multiple nodes, then broadcast the other Ti to all nodes (~Shuffle) 3.Perform join at each node, and store back answers (~Reduce) Parallel DBs used in the paper: DBMS-X and Vertica MR representative: Hadoop

77 Advantages of Parallel DBs over MR Schema Support: more structured storage of data –Really needed? Indexing: e.g., B-trees –Really needed? What about BigTable? Programming model: more expressive –What about Hive and Pig Latin? Execution strategy: push data instead of pull (as in MR) –They argue it reduces bottlenecks Really? What about the network I/O bottleneck?

88 Load Times In general best for Hadoop –The cost of “structured schema” and “indexing” –Data loaded on nodes sequentially! (implementation artifact?) –Are these operations really needed?

99 Actual Execution Hadoop always seems worst –Overhead rises linearly with data stored per node Vertica is best –Aggressive compression –More effective as more data stored per node

10 Other Tasks MR was never built for Join! –What about the MapReduce-Merge system? UDF does worst because of row-by-row interaction of DB and input file which is outside DB –How common is this in parallel DBs?

11 What have we learnt? Load time advantage is in 1000s of seconds for MR, while execution time disadvantage is in 10s of seconds –Means what? Pre-processing cuts down on processing time –E.g., selection task: Vertica and DBMS-X already use an index on pageRank column –No reason why we can’t do this preprocessing using MR! MapReduce is better matched to on-demand computations, while parallel DBs are better for repeated computations.

12 Discussion Points Performance vs. Complexity: does the study account for this tradeoff? –Which is more complex: RDB’s or MR? “It is not clear how many MR users really need 1000 nodes.” (page 2) –What do you think? –What about shared Hadoop services?

13 On death, taxes, and the convergence of peer to peer and Grid Computing I. Foster et al IPTPS 2003

14 Context Written by the “father” of Grid Computing (Ian Foster) Written in an era where p2p was very active research area Grid computing was very active P2P and Grid communities were separate –Different researchers and conferences with not much interaction between them Cloud computing had not yet emerged 14

15 Wisconsin MIT NCSA Distributed Computing Resources

16 An Application Coded by a Physicist Job 0 Job 2 Job 1 Job 3 Output files of Job 0 Input to Job 2 Output files of Job 2 Input to Job 3 Jobs 1 and 2 can be concurrent

17 Wisconsin MIT NCSA Job 0 Job 2Job 1 Job 3

18 Job 0 Job 2Job 1 Job 3 Wisconsin MIT Condor Protocol NCSA Globus Protocol

19 The Grid Recently Some are 40Gbps links! (The TeraGrid links) “A parallel Internet”

20 Some Things Grid Researchers Consider Important Single sign-on: collective job set should require once-only user authentication Mapping to local security mechanisms: some sites use Kerberos, others using Unix Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job 1 Community authorization: e.g., third-party authentication For clouds, you need to additionally worry about failures, scale, on-demand nature, and so on.

21 “We must address scale & failure” “We need infrastructure” P2PGrid

22 Definitions Grid P2P “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998) “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002) “Applications that takes advantage of resources at the edges of the Internet” (2000) “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002)

23 Definitions Grid P2P “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998) “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002) “Applications that takes advantage of resources at the edges of the Internet” (2000) “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002) 525: (good legal applications without intellectual fodder) 525: (clever designs without good, legal applications)

24 Grid versus P2P - Pick your favorite

25 Applications Grid Often complex & involving various combinations of –Data manipulation –Computation –Tele-instrumentation Wide range of computational models, e.g. –Embarrassingly || –Tightly coupled –Workflow Consequence –Complexity often inherent in the application itself P2P Some –File sharing –Number crunching –Content distribution –Measurements Legal Applications? Consequence –Low Complexity

26 Applications Grid Often complex & involving various combinations of –Data manipulation –Computation –Tele-instrumentation Wide range of computational models, e.g. –Embarrassingly || –Tightly coupled –Workflow Consequence –Complexity often inherent in the application itself P2P Some –File sharing –Number crunching –Content distribution –Measurements Legal Applications? Consequence –Low Complexity

27 Applications Grid Often complex & involving various combinations of –Data manipulation –Computation –Tele-instrumentation Wide range of computational models, e.g. –Embarrassingly || –Tightly coupled –Workflow Consequence –Complexity often inherent in the application itself P2P Some –File sharing –Number crunching –Content distribution –Measurements Legal Applications? Consequence –Low Complexity Clouds:

28 Scale and Failure Grid Moderate number of entities –10s institutions, 1000s users Large amounts of activity –4.5 TB/day (D0 experiment) Approaches to failure reflect assumptions –E.g., centralized components P2P V. large numbers of entities Moderate activity –E.g., 1-2 TB in Gnutella (’01) Diverse approaches to failure –Centralized (SETI) –Decentralized and Self- Stabilizing FastTrackC4,277,745 iMesh1,398,532 eDonkey500,289 DirectConnec t 111,454 Blubster100,266 FileNavigator14,400 Ares7,731 ( 2/19/’03)

29 Scale and Failure Grid Moderate number of entities –10s institutions, 1000s users Large amounts of activity –4.5 TB/day (D0 experiment) Approaches to failure reflect assumptions –E.g., centralized components P2P V. large numbers of entities Moderate activity –E.g., 1-2 TB in Gnutella (’01) Diverse approaches to failure –Centralized (SETI) –Decentralized and Self- Stabilizing FastTrackC4,277,745 iMesh1,398,532 eDonkey500,289 DirectConnec t 111,454 Blubster100,266 FileNavigator14,400 Ares7,731 ( 2/19/’03)

30 Scale and Failure Grid Moderate number of entities –10s institutions, 1000s users Large amounts of activity –4.5 TB/day (D0 experiment) Approaches to failure reflect assumptions –E.g., centralized components P2P V. large numbers of entities Moderate activity –E.g., 1-2 TB in Gnutella (’01) Diverse approaches to failure –Centralized (SETI) –Decentralized and Self- Stabilizing FastTrackC4,277,745 iMesh1,398,532 eDonkey500,289 DirectConnec t 111,454 Blubster100,266 FileNavigator14,400 Ares7,731 ( 2/19/’03) Clouds:

31 Services and Infrastructure Grid Standard protocols (Global Grid Forum, etc.) De facto standard software (open source Globus Toolkit) Shared infrastructure (authentication, discovery, resource access, etc.) Consequences Reusable services Large developer & user communities Interoperability & code reuse P2P Each application defines & deploys completely independent “infrastructure” JXTA, BOINC, XtremWeb? Efforts started to define common APIs, albeit with limited scope to date Consequences New (albeit simple) install per application Interoperability & code reuse not achieved

32 Services and Infrastructure Grid Standard protocols (Global Grid Forum, etc.) De facto standard software (open source Globus Toolkit) Shared infrastructure (authentication, discovery, resource access, etc.) Consequences Reusable services Large developer & user communities Interoperability & code reuse P2P Each application defines & deploys completely independent “infrastructure” JXTA, BOINC, XtremWeb? Efforts started to define common APIs, albeit with limited scope to date Consequences New (albeit simple) install per application Interoperability & code reuse not achieved

33 Services and Infrastructure Grid Standard protocols (Global Grid Forum, etc.) De facto standard software (open source Globus Toolkit) Shared infrastructure (authentication, discovery, resource access, etc.) Consequences Reusable services Large developer & user communities Interoperability & code reuse P2P Each application defines & deploys completely independent “infrastructure” JXTA, BOINC, XtremWeb? Efforts started to define common APIs, albeit with limited scope to date Consequences New (albeit simple) install per application Interoperability & code reuse not achieved Clouds:

34 Coolness Factor GridP2P

35 Coolness Factor GridP2P Clouds:

36 Summary: Grid and P2P 1) Both are concerned with the same general problem –Resource sharing within virtual communities 2) Both take the same general approach –Creation of overlays that need not correspond in structure to underlying organizational structures 3) Each has made genuine technical advances, but in complementary directions –“Grid addresses infrastructure but not yet scale and failure” –“P2P addresses scale and failure but not yet infrastructure” 4) Complementary strengths and weaknesses => room for collaboration (Ian Foster at UChicago) Has this already happened? Are clouds where it has happened?

37 2 P2P or Not 2 P2P M. Roussopoulos et al IPTPS 2004 (Inspired by)

38 Should I build a P2P solution to…?

39 Challenges for you Can you build a similar decision tree for cloud problems? –Function of data sizes involved computation and I/O involved usage characteristics (streaming vs. batch) Can you extend this to a choice between parallel DBs and MR?

40 Scooped Again J. Ledlie et al IPTPS 2004 (Inspired by)

41 Killer apps! Computer Scientists invented the Internet What is the killer app for the Internet? –The Web Who designed the Web? –Not Computer Scientists! (A physicist!) Computer Scientists invented p2p systems What is the killer app for p2p? –File sharing Who designed the killer app? –Not Computer Scientists!

42 Can we build them? Wireless/Ad-hoc networks –Killer app: social networks? Computer Scientists invented datacenters and cloud services… –The story continues? Sensor networks? Some opine that Computer Scientists have never been good at building killer applications What do you think?