Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

Slides:



Advertisements
Similar presentations
Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE.
Advertisements

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Senior Project Manager & Architect Love Your Data.
Cloudera & Hadoop Use Cases Rob Lancaster | Omer Trajman "Big Data"... Applications From Enterprises to Individuals.
Transform + analyze Visualize + decide Capture + manage Dat a.
Hive: A data warehouse on Hadoop
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop Ecosystem Overview
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Introduction to MySQL  Working with MySQL and MySQL Workbench.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Microsoft Ignite /28/2017 6:07 PM
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Introduction to Hadoop
Hadoop Architecture Mr. Sriram
Hadoop.
Apache hadoop & Mapreduce
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MSBIC Hadoop Series Processing Data with Pig
Hadoopla: Microsoft and the Hadoop Ecosystem
Central Florida Business Intelligence User Group
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Hadoop and Spark
Introduction to Apache
Overview of big data tools
Setup Sqoop.
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

2 ©2011 Cloudera, Inc. All Rights Reserved.Cloudera 2 HadoopLinux LicenceApacheGPL and others Distribution VendorClouderaRed Hat Free Distribution Cloudera's Distribution Including Hadoop (CDH) Fedora Core Commercial DistributionCloudera Enterprise Red Hat Enterprise Linux (RHEL)

3 ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core 3 HDFS MapReduce

4 ©2011 Cloudera, Inc. All Rights Reserved.HDFS 4 Hadoop Distributed File System Redundancy Fault Tolerant Self Healing Write Once, Read Many Times Java API Command Line Tool

5 ©2011 Cloudera, Inc. All Rights Reserved.MapReduce 5 Two Phases of Functional Programming Redundancy Fault Tolerant Self Healing Java API

6 ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core 6 HDFS MapReduce

7 ©2011 Cloudera, Inc. All Rights Reserved.HDFS-FUSE 7 /mnt/hdfs/ HDFS-FUSE HDFS

8 ©2011 Cloudera, Inc. All Rights Reserved. HDFS-FUSE Examples 8 $ mount... fuse on /mnt/hdfs type fuse (rw,nosuid,nodev,user_id=0,group_id=0,default_permissions,allow_other) $ cp /boot/vmlinuz-* /mnt/hdfs/user/cloudera/ $ hadoop fs -ls vmlinuz-*-rw-r--r-- 3 cloudera supergroup :14 /user/cloudera/vmlinuz el5

9 ©2011 Cloudera, Inc. All Rights Reserved.Sqoop 9 RDBMS Sqoop HDFS

10 ©2011 Cloudera, Inc. All Rights Reserved.Sqoop 10 Import & Export ODBC, JDBC Data Sources CSV Files in HDFS

11 ©2011 Cloudera, Inc. All Rights Reserved. Sqoop Examples 11 $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City... $ hadoop fs -cat City/part-m ,Kabul,AFG,Kabol, ,Qandahar,AFG,Qandahar, ,Herat,AFG,H erat, ,Mazar-e-Sharif,AFG,Balkh, ,Amsterdam,NLD,Noord- Holland,

12 ©2011 Cloudera, Inc. All Rights Reserved.Hive 12 MapReduce Hive SQL

13 ©2011 Cloudera, Inc. All Rights Reserved.Hive 13 Data Warehouse System for Hadoop Data Aggregation Ad-Hoc Queries SQL-like Language (HiveQL) Developed at facebook

14 ©2011 Cloudera, Inc. All Rights Reserved. Hive Examples 14 CREATE TABLE newmovie (id INT, name STRING, year INT, numratings INT, avgrating FLOAT);INSERT OVERWRITE TABLE newmovieSELECT id, name, year, COUNT(1), AVG(rating)FROM movie JOIN movieratingON movie.id = movierating.movieidGROUP BY id, name, year;

15 ©2011 Cloudera, Inc. All Rights Reserved.Pig 15 MapReduce Pig Script

16 ©2011 Cloudera, Inc. All Rights Reserved.Pig 16 Data Warehouse System for Hadoop Data Aggregation Ad-Hoc Queries High-Level Scripting Language (Pig Latin) Developed at Yahoo

17 ©2011 Cloudera, Inc. All Rights Reserved. Pig Examples 17 movierating = LOAD 'movierating' AS (userid, movieid, rating:INT);groupmr = GROUP movierating BY movieid;ratings = FOREACH groupmr GENERATE group AS movieid, COUNT(movierating.rating) AS numratings, AVG(movierating.rating) AS avgrating;movie = LOAD 'movie' AS (id, name, year);mr = JOIN movie BY id, ratings BY movieid;result = FOREACH mr GENERATE id, name, year, numratings, avgrating;STORE result INTO 'ratedmovie';

18 ©2011 Cloudera, Inc. All Rights Reserved. The Story So Far 18 RDBMS HivePig Sqoop MapReduce HDFS

19 ©2011 Cloudera, Inc. All Rights Reserved.HBase 19 Low Latency Random Reads And Writes Distributed Key/Value Store Simple API –PUT –GET –DELETE –SCANE

20 ©2011 Cloudera, Inc. All Rights Reserved. HBase Data Model 20 Key RowIDColumnameTimestampValue com.apple.wwwSizeyesterday1234 com.apple.wwwContentyesterday... com.cloudera.wwwSizeyesterday2345 com.cloudera.wwwContentyesterday... com.cloudera.wwwSizetoday3456 com.cloudera.wwwContenttoday... com.facebook.wwwSizeyesterday4567 com.facebook.wwwContentyesterday... com.yahoo.wwwSizetoday5678 com.yahoo.wwwContenttoday...

21 ©2011 Cloudera, Inc. All Rights Reserved. HBase Flow 21 GET/PUT/DELETE MEMORY HDFS Logfile

22 ©2011 Cloudera, Inc. All Rights Reserved.Flume 22 Many Servers with many Log Files –Webserver –Mailserver –Syslog Store all Logs in One Place –Manageable –Extensible –Reliable

23 ©2011 Cloudera, Inc. All Rights Reserved. Flume Architecture 23 Log Flume Node Log Flume Node... HDFS

24 ©2011 Cloudera, Inc. All Rights Reserved. Flume Sources and Sinks 24 Local Files HDFS Stdin, Stdout Twitter IRC IMAP

25 ©2011 Cloudera, Inc. All Rights Reserved.Whirr 25 Automatic Cluster Setup in the Cloud –Amazon –Rackspace

26 ©2011 Cloudera, Inc. All Rights Reserved. Whirr Example 26 $ cat hadoop.properties whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,7 hadoop- datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=${env:AWS_ACCESS_KEY_ID} whirr.credential=${env:AWS_SECRET_ACCESS_KEY} whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub $ bin/whirr launch-cluster --config hadoop.properties $. ~/.whirr/myhadoopcluster/hadoop-proxy.sh $ export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster $ bin/whirr destroy-cluster --config hadoop.properties

27 ©2011 Cloudera, Inc. All Rights Reserved. Oozie Concept 27 crond for Hadoop Job Flow Control –Branching –Serial –Loops Triggered –Time –Data Job 1 Job 3 Job 2 Job 4Job 5

28 ©2011 Cloudera, Inc. All Rights Reserved. Oozie Features 28 Component Independent –MapReduce –Hive –Pig –Sqoop –Streaming

29 ©2011 Cloudera, Inc. All Rights Reserved.Mahout Machine Learning Library for Hadoop –Regression –Classification –Recommendations –Pattern Mining 29

30 ©2011 Cloudera, Inc. All Rights Reserved. Mahout Use Cases Yahoo: Spam Detection Foursquare: Recommendations SpeedDate.com: Recommendations Adobe: User Targetting Amazon: Personalization Platform 30

31 ©2011 Cloudera, Inc. All Rights Reserved.CDH4u2 31 Cloudera's Distribution Including Hadoop Linux Packages –Red Hat –Debian –Tar Archive Virtual Machines Cloud Installation with Whirr

32 ©2011 Cloudera, Inc. All Rights Reserved. CDH Components 32 HadoopHive PigHBase ZookeeperFlume SqoopWhirr HueOozie FUSE-DFSMahout

33 ©2011 Cloudera, Inc. All Rights Reserved. Thank you! Kai Voigt 33