Yunhong Gu and Robert Grossman University of Illinois at Chicago 碩資工一甲 王聖爵 1098308103.

Slides:



Advertisements
Similar presentations
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Distributed Data Storage and Processing over Commodity Clusters Sector & Sphere Yunhong Gu Univ. of Illinois at of Chicago, Feb. 17, 2009.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
An Introduction to Sector/Sphere Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago and VeryCloud June 22, 2010.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Data Storage and Parallel Processing Engine Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago.
Terasort Using SAGA-MapReduce Given by: Sharath Maddineni
On the Varieties of Clouds for Data Intensive Computing 董耀文 Antslab Robert L. Grossman University of Illinois at Chicago And Open Data.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.
MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL Robert L.
17-April-2007 High Performance Computing Basics April 17, 2007 Dr. David J. Haglin.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium.
Sector and Sphere: the design and implementation of a high-performance data cloud by Yunhong Gu, and Robert L. Grossman Philosophical Transactions A Volume.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Advanced Operating Systems Chapter 6.1 – Characteristics of a DFS Jongchan Shin.
Hadoop Aakash Kag What Why How 1.
Introduction to MapReduce and Hadoop
Hadoop Clusters Tess Fulkerson.
A Survey on Distributed File Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Hadoop Technopoints.
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Yunhong Gu and Robert Grossman University of Illinois at Chicago 碩資工一甲 王聖爵

 Commodity clusters can be done simply given the right programming structure.  MapReduce and Hadoop has focused on systems within a data center.

 Sphere: server heterogeneity, load balancing, fault tolerance, transparent to developers.  Unlike MapReduce or Hadoop,Sphere supports distributed data processing on a global scale

 Clusters of commodity workstations and high performance network are ubiquitous.  Scientific instruments routinely produce terabytes or even petabytes of data every year

 The most well known cloud computing system is Google‘s GFS/MapReduce/BigTable stack and its open source implementation Hadoop  The approachtaken by cloud computing is to provide a very simple distributed programming interface by limiting the type of operations supported

 All of these systems are set up on racks of clusters within a single data center.  需求上的困境 EX: 跨國合作計畫、粒子對撞資料、基因運算等大 型科學計畫合作項目等.

 Sphere client API do not need to locate and move data explicitly nor do they need locate computing resources  Sphere uses a stream processing paradigm to process large datasets.

For (int i = 0 ; i < ;++i) process(data[i]) Before Sphere Sphere.run(data,process)

 The majority of the processing time for many data intensive applications is spent in loops like these;  developers typically spend a lot of their time parallelizing these types of loops (e.g., with PVM or MPI).

 Sector provides functionality similar to that of a distributed file system  Sphere runs on top of a distributed file system called Sector  Google’s GFS ←→ Sector

 The Security server maintains user accounts, passwords, privileges on each of the files or directories.

 The master server maintains the metadata of the files stored in the system, controls the running of all slaves, responds to users' requests.  The master communicates with the security server to verify the slaves and the clients/users.

 The slaves are the nodes that actually store files and process the data upon request.  The slaves are usually racks of computers that are located in one or more data centers.

 1 billion astronomical images  The average size of an image is 1MB  Total data size is 1TB  The SDSS dataset is stored in N file, named SDSS1.dat …,SDSSn.dat  The record insexes are named by adding a “.idx” postfix : SDSS1.data.idx,…,SDSSn.data.idx  Function “findBrownDwarf”

for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …); A stand serial program might look this: Sphere SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss,"findBrownDwarf"); myproc->read(result);

 AMD Opteron 2.4GHz or 3.0GHz, 2-4GB RAM, TB disk and are connected by 10Gb/s wide area networks. Of the 10 machines  2 are in Chicago, IL,  4 are in Greenbelt, MD,  2 are in Pasadena, CA,  2 are in Tokyo, Japan.

 Sphere: server heterogeneity, load balancing, fault tolerance, transparent to developers.