Hadoop Ecosystem Overview

Slides:



Advertisements
Similar presentations
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Advertisements

© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
HADOOP ADMIN: Session -2
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hive Facebook 2009.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
An Introduction to HDInsight June 27 th,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Next Generation of Apache Hadoop MapReduce Owen
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
OMOP CDM on Hadoop Reference Architecture
Image taken from: slideshare
PROTECT | OPTIMIZE | TRANSFORM
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
Spark Presentation.
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Introduction to Apache
Overview of big data tools
Setup Sqoop.
TIM TAYLOR AND JOSH NEEDHAM
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future lectures Discuss potential use cases for each project

Topics HDFS MapReduce YARN Sqoop Flume NiFi Pig Hive Streaming HBase Accumulo Avro Parquet Mahout Oozie Storm ZooKeeper Spark SQL-on-Hadoop In-Memory Stores Cassandra Kafka Crunch Azkaban

HDFS Hadoop Distributed File System We’ve talked about this enough High-performance file system for storing data We’ve talked about this enough

Hadoop MapReduce High-performance fault-tolerance data processing system We’ve also talked about this enough

YARN Abstract framework for distributed application development Split functionality of JobTracker into two components ResourceManager ApplicationMaster TaskTracker becomes NodeManager Containers instead of map and reduce slots Configurable amount of memory per NodeManager RM manages global assignment of compute resources to applications AM manages application life cycle – tasked to negotiate resources form the RM and works with NM to execute and monitor tasks NodeManager executes containers

MapReduce 2.x on YARN MapReduce API has not changed Binary-level backwards compatible (no recompile) Application Master launches and monitors job via YARN MapReduce History Server to store… history Enabled Yahoo! to scale beyond 4,000 nodes In YARN, a MapReduce application is equivalent to a job, executed by the MapReduce AM

Hadoop Ecosystem Core Technologies Many other tools… Hadoop Distributed File System Hadoop MapReduce Many other tools… Which we will be discussing… now

Apache Sqoop Apache project designed for efficient transfer between Apache Hadoop and structured data stores Use through CLI and extendable Use cases? Migrating off EDWs Pushing MapReduce output to EDW for application consumption or analytics

Apache Flume Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data Configure agents using simple files, extendable Use cases? Capturing log files from web servers Creating complex topologies for moving data Can support ‘events’ instead of just log data

Apache NiFi A service to reliably move and manipulate files between clusters using a web front-end Uses a GUI to drop processors and connect them to build workflows Use cases? Reliably push files between HDFS clusters

Apache Pig Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Infrastructure compiles language to a sequence of MapReduce programs Use cases? Enabling MR analytics to those who don’t know Java Rapid prototyping of future Java analytics High-level languages improve over time

Apache Hive Data warehouse facilitating querying and managing large datasets Compiles SQL-like queries into MapReduce programs Use cases? Enabling MR analytics to those who don’t know Java Rapid prototyping of future Java analytics High-level languages improve over time

Hadoop Streaming Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer Just a jar file, not a real project Use cases? Enabling MR analytics to those who don’t know Java Rapid prototyping of future Java analytics High-level languages improve over time

Which high-level API is for you? What are you comfortable with? What are you being told to use? Boils down to what you are comfortable with and what are you being told to use. All of them are highly extensible and

Apache HBase Distributed, scalable, big data store Data stored as sorted key/value pairs, with the key consisting of a row and column Use cases? Sub-second fetches of key/value pairs and fast small range scans CRUD operations Generally faster at reads than Accumulo

Apache Accumulo Robust, scalable, high-performance data storage and retrieval key/value store Cell-based access controls i.e. cell-level security Use cases? Sub-second fetches of key/value pairs and fast small range scans CRUD operations Cell-level security is built into the system Back-end iterators give automated operations on tables Generally faster at writes than HBase

Apache Avro Data serialization system for the Hadoop ecosystem Use cases? A unified data format across all Hadoop ecosystem projects Built for Hadoop

Apache Parquet Columnar storage format for Hadoop Use cases? A unified data format across all Hadoop ecosystem projects Built for Hadoop Columnar storage format can enhance how some projects access data for partition pruning (SQL on Hadoop projects, specifically)

Apache Mahout Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce Use cases? Do cool and scalable analytics

Apache Oozie Workflow scheduler system to manage Apache Hadoop jobs Use cases? Can create complex application workflows and have them scheduled via Oozie

Apache Storm Distributed real-time computation system Didn’t have a logo until June 2014 How is this different than MapReduce? Use cases? Scales to hundreds of thousands of IOPs per node Perform functions on streaming data to get immediate value rather than using MR for batch processes

Apache ZooKeeper Effort to develop and maintain and open-source server enabling highly reliable distributed coordination Use cases? Solves a lot of problems you generally have when building distributed systems Can be used for configuration, group services, pretty much anything you want it to do. It’ll remember your birthday without being told by Facebook that it was your birthday. Really great guy.

Apache Spark Fast and general engine for large-scale data processing Write applications in Java, Scala, or Python Use cases? It’s can do in-memory MR, removing the costly “writing out to HDFS between each phase” Has libraries for streaming, SQL, machine learning, and graphs

SQL on Hadoop Apache Drill, Cloudera Impala, Facebook’s Presto, Hortonworks’s Hive Stinger, Pivotal HAWQ, etc. SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store Use cases? Non use cases? No overhead with MR gives faster execution times Opens a new paradigm in working with large data sets Can run an MPP database as part of your Hadoop cluster, rather than using an expensive process to move data from HDFS to an EDW

Sample Architecture SQL Pig HBase Storm HDFS SQL Flume Agent Oozie Webserver Website Flume Agent Sales MapReduce Pig HBase Storm HDFS Flume Agent Call Center SQL

We [maybe] won’t be covering these in detail later on These ones we won’t be covering in detail in the class (either less common, mildly unrelated, or fairly new). Students can use them with permission Other Hadoop Projects

Redis, Memcached, etc. Open-source in-memory key/value stores Use cases? Very helpful in supplementing MapReduce analytics or otherwise giving you fast data lookups

Apache Cassandra NoSQL database for managing large amounts of structured, semi-structured, and unstructured data Support for clusters spanning multiple datacenters Unlike HBase and Accumulo, data is not stored on HDFS Use cases? Non use cases? More performant system than HBase and Accumulo since data is not stored on HDFS Built-in WAN replication

Apache Crunch Java framework for writing, testing, and running MapReduce pipelines with a simple API Same code executes as a local job, as a MapReduce job, or as a streaming Spark job Use cases? * *Not the real logo, but truly fantastic

Apache Kafka High-throughput distributed publish-subscribe message service Use cases? Have multiple brokers publish messages to hundreds of ‘topics’ of interest Clients subscribe to these messages and pull them off the queue for consumption

Azkaban Batch workflow job scheduler to run Hadoop jobs Use cases? Same as Oozie but apparently has a cool web UI

Review A lot of projects available to you for your grou project Think of a problem you are interested in, then choose the appropriate projects to solve it Keep in mind data ingest, storage, processing, and egress Feel free to explore and use other projects than the ones I have listed here Get permission if you plan on using it as part of your project quota

References All those logos are the property of their owners *.apache.org redis.io