An Introduction to Sector/Sphere Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago and VeryCloud June 22, 2010.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Distributed Data Storage and Processing over Commodity Clusters Sector & Sphere Yunhong Gu Univ. of Illinois at of Chicago, Feb. 17, 2009.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Allocation Methods - Contiguous
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
1 The Google File System Reporter: You-Wei Zhang.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HDFS Hadoop Distributed File System
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Distributed Data Storage and Parallel Processing Engine Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago.
On the Varieties of Clouds for Data Intensive Computing 董耀文 Antslab Robert L. Grossman University of Illinois at Chicago And Open Data.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Project Matsu: Large Scale On-Demand Image Processing for Disaster Relief Collin Bennett, Robert Grossman, Yunhong Gu, and Andrew Levine Open Cloud Consortium.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Yunhong Gu and Robert Grossman University of Illinois at Chicago 碩資工一甲 王聖爵
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Rate Control Rate control tunes the packet sending rate. No more than one packet can be sent during each packet sending period. Additive Increase: Every.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Virtualization Technology and Microsoft Virtual PC 2007 YOU ARE WELCOME By : Osama Tamimi.
Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Implementation of Simple Cloud-based Distributed File System Group ID: 4 Baolin Wu, Liushan Yang, Pengyu Ji.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Introduction to Distributed Platforms
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
Introduction to Networks
Storage Virtualization
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

An Introduction to Sector/Sphere Sector & Sphere Yunhong Gu Univ. of Illinois at Chicago and VeryCloud June 22, 2010

What is Sector/Sphere? Sector: Distributed File System Sphere: Simplified Parallel Data Processing Framework Goal: handling big data on commodity clusters Open source software, BSD license, written in C++. Started since 2006, current version 2.3

Motivation: Data Locality Super-computer model: Expensive, data IO bottleneck Sector/Sphere model: Inexpensive, parallel data IO, data locality

Motivation: Simplified Programming Parallel/Distributed Programming with MPI, etc.: Flexible and powerful. very complicated application development Sector/Sphere model (cloud model): Clusters regarded as a single entity to the developer, simplified programming interface. Limited to certain data parallel applications.

Motivation: Global-scale System Systems for single data centers: Requires additional effort to locate and move data. Sector/Sphere model: Support wide-area data collection and distribution.

Sector Distributed File System DFS designed to work on commodity hardware  racks of computers with internal hard disks and high speed network connections File system level fault tolerance via replication Support wide area networks  Can be used for data collection and distribution Not POSIX-compatible yet

Sector Distributed File System Security ServerMasters slaves SSL Clients User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Storage and Processing Data UDT Encryption optional

Security Server User accounts, permission, IP access control lists Use independent accounts, but connect to existing account database via a simple “driver”, e.g., Linux accounts, LDAP, etc. Single security server, system continue to run when security server is down, but new users cannot login

Master Servers Maintain file system metadata  Metadata is a customizable module, currently there are two implementations, one in-memory and one on disk Authenticate users, slaves, and other masters (via security server) Maintain and manage file replication, data IO and data processing requests  Topology aware Multiple active masters can dynamically join and leave; load balancing between masters

Slave Nodes Store Sector files  Sector file is not split into blocks  One Sector file is stored on the “native” file system (e.g., EXT, XFS, etc.) of one or more slave nodes Process Sector data  Data is processed on the same storage node, or nearest storage node possible  Input and output are Sector files

Clients Sector file system client API  Access Sector files in applications using the C++ API Sector system tools  File system access tools FUSE  Mount Sector file system as a local directory Sphere programming API  Develop parallel data processing applications to process Sector data with a set of simple API The client communicate with slave directly for data IO, via UDT

UDT: UDP-based Data Transfer Open source UDP based data transfer protocol  With reliability control and congestion control Fast, firewall friendly, easy to use Already used in many commercial and research systems for large data transfer

Application-aware File System Files are not split into blocks  Users are responsible to use proper sized files Directory and File Family  Sector will keep related files together during upload and replication In-memory object

Sphere: Simplified Data Processing Data parallel applications Data is processed at where it resides, or on the nearest possible node (locality) Same user defined functions (UDF) are applied on all elements (records, blocks, files, or directories) Processing output can be written to Sector files or sent back to the client Transparent load balancing and fault tolerance

Sphere: Simplified Data Processing for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …); SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss,"findBrownDwarf", …); findBrownDwarf(char* image, int isize, char* result, int rsize);

Sphere: Data Movement Slave -> Slave Local Slave -> Slaves (Hash/Buckets)  Each output record is assigned an ID; all records with the same ID are sent to the same “bucket” file Slave -> Client

What does a Sphere program like? A client application  Specify input, output, and name of UDF  Inputs and outputs are usually Sector directories or collection of files  May have multiple round of computation if necessary (iterative/combinative processing) A UDF  A C++ function following the Sphere specification (parameters and return value)  Compiled into a dynamic library

Sphere/UDF vs. MapReduce Map = UDF MapReduce = 2x UDF  First UDF generates bucket files and second processes the bucket files.

Sphere/UDF vs. MapReduce Sphere is more flexible and efficient  UDF can be applied directly on records, blocks, files, and even directories  Support multiple inputs/outputs with better data locality, including certain legacy applications that process files and directories  Native binary data support w/ permanent index files  Sorting is required by Reduce, but it is optional in Sphere Output locality allows Sphere to combine multiple operations more efficiently

Sphere Benchmarks Terasort: sort 1TB data over distributed servers Malstone: detect malware website from billions of transactions Graph processing: analyze very large social networks at billions of vertices (BFS and enumerating cliques) Genome pipeline: analyze genome sequences Satellite image processing: compare satellite images of different time, for disaster relief Sphere is about 2 – 4 times faster than Hadoop

Open Cloud Testbed 15 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2) 10Gb/s inter-site connection on CiscoWave 1 - 2Gb/s inter-rack connection Two dual-core AMD CPU, GB RAM, 1-4TB RAID-0 disk

Open Cloud Testbed

Development Status Current version 2.3, all core functions ready, still working on to improve code quality and details for certain modules. Partly funded by NSF for NCDM/UIC Commercial support via VeryCloud LLC Next step: support column-based data tables (similar to BigTable) Open source contributors are welcome

More Information Sector Website: