Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Slides:



Advertisements
Similar presentations
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
Spark: Cluster Computing with Working Sets
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
DATA DEDUPLICATION By: Lily Contreras April 15, 2010.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Limitless Storage, Boundless Opportunities Technology Overview – January 2009.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture V: 2014/04/07.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.
Load Rebalancing for Distributed File Systems in Clouds.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Big Data is a Big Deal!.
Hadoop Aakash Kag What Why How 1.
Hadoop.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Spark Presentation.
PA an Coordinated Memory Caching for Parallel Jobs
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
CS110: Discussion about Spark
Hadoop Technopoints.
Presentation transcript:

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage

Copyright © 2012 Cleversafe, Inc. All rights reserved. 2 How Cleversafe’s Dispersed Storage Works Data is expanded, virtualized, transformed, sliced and dispersed using Information Dispersal Algorithms. 1 DATA Cleversafe IDA Real- time bit perfect data is retrieved from a subset of slices. 3 SITE 1 SITE 2SITE 3SITE 4 Slices are distributed to separate disks, storage nodes and geographic locations. 2 DATA [ Total slices = ‘width’ = N ] [ Subset required to read = ‘threshold’ = K ] Cleversafe Confidential Information

Copyright © 2012 Cleversafe, Inc. All rights reserved. 3 Object-based Access Methods

Copyright © 2012 Cleversafe, Inc. All rights reserved. 4 How Hadoop Works Popular open-source MapReduce implementation, commercialized by Cloudera and others Take the computation to the data, not the data to the computation Cleversafe Confidential Information Compute Storage

Copyright © 2012 Cleversafe, Inc. All rights reserved. 5 Hadoop MapReduce Challenges Master-slave architecture: Namenode –Point of failure: Previously a single point of failure, now a clustered point of failure with HA –Scalability bottleneck: In the I/O path. NameNode federation helps, but introduces administrative headaches and increases failure footprint Efficiency: Replication –Maintains 3 copies of data for protection – not a big deal in terabyte range – but scale up to petabyte and Exabyte levels and management/overhead costs are unmanageable Cleversafe Confidential Information

Copyright © 2012 Cleversafe, Inc. All rights reserved. 6 dsNet Slicestor Combining computation and dispersed storage Hadoop MapReduce computation runs directly on dsNet Slicestors Jobs are assigned to stores for completely local data access Replace underlying HDFS with Dispersed Storage® while maintaining HDFS interface to MapReduce process dsNet Storage dsNet API Hadoop MapReduce Local data access Cleversafe Confidential Information

Copyright © 2012 Cleversafe, Inc. All rights reserved. 7 System Architecture Cleversafe Confidential Information MASTER Job Tracker Log SLAVES ACCESSERS Maps Reduces Maps Reduces Object Vaults Object Vaults Metadata Vaults Metadata Vaults Analytic Vaults Analytic Vaults Task Tracker

Copyright © 2012 Cleversafe, Inc. All rights reserved. 8 New SliceStream™ Protocol Concept: Manipulate input so that, after dispersal, raw data falls in contiguous chunks Read directly from raw slices bypassing IDA reconstruction o Fall back to full IDA reconstruction if an error occurs Result: Full reliability/availability of dispersal On a healthy dsNet, most reads for a MapReduce task can be satisfied locally Cleversafe Confidential Information

Copyright © 2012 Cleversafe, Inc. All rights reserved. 9 Dispersal Pipeline for Hadoop SegmentationIDA Raw data stream Segmentation metadata & 1MB+ segments Slicestors Computationally useful slices Data Projection Write cache Compute optimized data chunks Cleversafe Confidential Information

Copyright © 2012 Cleversafe, Inc. All rights reserved. 10 HDFS Data Layout Chunk 1 Write 1 (64MB * 3x) Chunk 1 Read for Task 1 (64MB) Dispersed Computing

Copyright © 2012 Cleversafe, Inc. All rights reserved. 11 SliceStream™ Data Projection Segment 1 Write 1 (1MB) Chunk 1 Read for Task 1(64MB) Dispersed Computing

Copyright © 2012 Cleversafe, Inc. All rights reserved. 12 Indexing & Hadoop One bonus feature: Build & use Object Storage indexes from Hadoop jobs Build indexes on data using Indexing APIs from MapReduce jobs  Analyze and index data in parallel using index APIs  Search and query your indexed data Use indexes in MapReduce jobs to efficiently find the data you need to process  Index data and metadata at ingest or later using MapReduce  Query the index directly from MapReduce jobs to find the data you need to analyze  Perform targeted analysis on only the relevant data

Copyright © 2012 Cleversafe, Inc. All rights reserved. 13 Key Features and Benefits Cost-effective scalability –Infinite scalability in a single system Increased performance and productivity –Computation brought to the data –dsNet Slicestors provides both computation and storage –Geographic distribution enabled Lower storage costs –Information dispersal calls for one instance of the data vs. 3x with replication Significantly higher reliability and availability –Information dispersal eliminates single points of failure –Continuous data availability with multiple simultaneous device or site failures Drop in replacement for existing MapReduce jobs via standard Hadoop File System interfaces Cleversafe Confidential Information