Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development.

Slides:



Advertisements
Similar presentations
From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Advertisements

1/17/20141 Leveraging Cloudbursting To Drive Down IT Costs Eric Burgener Senior Vice President, Product Marketing March 9, 2010.
© 2006 DataCore Software Corp DataCore Traveller Travel in Time : Do More with Time The Continuous Protection and Recovery (CPR) Solution Time Optimized.
NetApp Confidential - Limited Use
Minimising IT costs, maximising operational efficiency minimising IT costs, maximising operational efficiency Balance.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
High throughput chain replication for read-mostly workloads
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Privileged Identity Management Enterprise Password Vault
1 Vladimir Knežević Microsoft Software d.o.o.. 80% Održavanje 80% Održavanje 20% New Cost Reduction Keep Business Up & Running End User Productivity End.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Symantec De-Duplication Solutions Complete Protection for your Information Driven Enterprise Richard Hobkirk Sr. Pre-Sales Consultant.
Citrix Partner Update The Citrix Delivery Centre.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice HP StorageWorks LeftHand update Marcus.
1© Copyright 2012 EMC Corporation. All rights reserved. November 2013 Oracle Continuous Availability – Technical Overview.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Is Windows Right for High-Availability Enterprise Applications? Dan Kusnetzky, Vice President System Software Research IDC.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Implementing Multi-Site Clusters April Trần Văn Huệ Nhất Nghệ CPLS.
1© Copyright 2014 EMC Corporation. All rights reserved. ORACLE CONTINUOUS AVAILABILITY NOVEMBER 2014.
© Novell, Inc. All rights reserved. 1 PlateSpin Protect Virtualize your Disaster Recovery.
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
Bring Consolidation Into Focus The Value of Compaq AlphaServer and Storage Consolidation Solutions Joseph Batista Director Enterprise & Internet Initiatives.
1 EMC CONFIDENTIAL—INTERNAL USE ONLY EMC’s End-to-End Capabilities for Microsoft EMC helps you successfully plan, design, deploy and manage your Microsoft.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
NetBackup PureDisk Kris Hagerman Sr. Vice President, Data Center Management.
IT Infrastructure Chap 1: Definition
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
May 29, 2009 | Bangalore. Evolving Rules of the Business New business models will be required to meet demand / competition from fast growing economies.
SIOS – Comprehensive High Availability Options for your VMware Environment.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Protect Your Business-Critical Data in the Cloud with SoftNAS, a Full-Featured, Highly Available Solution for the Agile Microsoft Azure Platform MICROSOFT.
OSIsoft High Availability PI Replication
VMware vSphere Configuration and Management v6
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Make VMs Resilient to Failures with Availability Sets.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
ETRI Site Introduction Han Namgoong,
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
1 F5 BIG-IP and NetApp SnapMirror & SnapVault F5 is a NetApp Advantage Alliace Partner with continuous engagement across teams in product management, product.
Deploying Highly Available SAP in the Cloud
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Designing Cisco Data Center Unified Fabric
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Introduction to Distributed Platforms
Secrets to Fast, Easy High Availability for SQL Server in AWS
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Veeam Backup Repository
Built on the Powerful Microsoft Azure Platform, Lievestro Delivers Care Information, Capacity Management Solutions to Hospitals, Medical Field MICROSOFT.
Capitalize on modern technology
Ministry of Higher Education
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
Ewen Cheslack-Postava
Hadoop Technopoints.
Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.
Zendos Tecnologia Utilizes the Powerful, Scalable
Presentation transcript:

Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development

My background ● 15 years Sun Microsystems veteran: JVM, distributed systems ● Vice President, Apache Bigtop ● Committer, PMC & contributor to various ASF projects ● Member of Apache IPMC ● Early Hadoop committer

3 WANdisco Background WANdisco: Wide Area Network Distributed Computing –Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability Leader in tools for software engineers – Subversion –Apache Software Foundation sponsor Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) US patented active-active replication technology granted, November 2012 Global locations –San Ramon (CA) –Chengdu (China) –Tokyo (Japan) –Boston (MA) –Sheffield (UK) –Belfast (UK)

Customers

Non-Stop Hadoop Non-Intrusive Plugin Provides Continuous Availability In the LAN / Across the WAN Active/Active

3 Key Problems For Multi Cluster Hadoop LAN / WAN

Enterprise Ready Hadoop Characteristics of Mission Critical Applications Require 100% Uptime of Hadoop –SLA’s, Regulatory Compliance Require HDFS to be Deployed Globally –Share Data Between Data Centers –Data is Consistent and Not Eventual Ease Administrative Burden –Reduce Operational Complexity –Simplify Disaster Recovery –Lower RTO/RPO Allow Maximum Utilization of Resource –Within the Data Center –Across Data Centers

Single Standby Inefficient utilization of resource –Journal Nodes –ZooKeeper Nodes –Standby Node Performance Bottleneck Still tied to the beeper Limited to LAN scope Active / Active All resources utilized –Only NameNode configuration –Scale as the cluster grows –All NameNodes active Load balancing Set resiliency (# of active NN) Global Consistency Breaking Away from Active/Passive What’s in a NameNode

Standby Datacenter Idle Resource –Single Data Center Ingest –Disaster Recovery Only One way synchronization –DistCp Error Prone –Clusters can diverge over time Difficult to scale > 2 Data Centers –Complexity of sharing data increases Active / Active DR Resource Available –Ingest at all Data Centers –Run Jobs in both Data Centers Replication is Multi-Directional –active/active Absolute Consistency –Single HDFS spans locations ‘N’ Data Center support –Global HDFS allows appropriate data to be shared Breaking Away from Active/Passive What’s in a Data Center

One Cluster Aproach Example Applications –HBASE –RT Query –Map Reduce Poor Resource Management –Data Locality Issues –Network Use –Complex Multiple Clusters

Creating Multiple Clusters Example Applications –HBASE –RT Query –Map Reduce Need to share data between clusters –DistCp / Stale Data –Inefficient use of storage and or network –Some clusters may not be available Multiple Clusters

Cluster Zones Zoning for Optimal Efficiency 1 100% HDFS Consistency

Multi Datacenter Hadoop Disaster Recovery WAN REPLICATION Absolute Consistency Maximum Resource Use Lower Recovery Time/Point Replicate Only What You Want Better Utilization of Power/Cooling Lower TCO LAN Speed Performance

Architecture of a Non-Stop Hadoop

Technical Use Cases Eliminate Performance Bottleneck –HBASE issues Multi Data-Center Ingest –Information doesn't need to be sent to one DC and then copied back to the other using DistCP –Parallel ingest methods don’t require redirected data streams –Ingest data at, or close to the source –Global Analysis (Logs, Click Streams, etc…) Cluster Zones –Efficient use of resource based on application profile –HBASE, MapReduce, SPARK, etc… Maximize Data Center Resource Utilization –All datacenters can be used to run different jobs concurrently Disaster Recovery –Data is as current as possible (no periodic synchs) –Virtually zero downtime to recover from regional data center failure –Regulatory compliance

Non-Stop Hadoop Demonstration

Q & A

Thank you