School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.

Slides:

Advertisements

Similar presentations

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.

Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer

Google Distributed System and Hadoop Lakshmi Thyagarajan.

The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Owen O’Malley Yahoo! Grid Team

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.

Hadoop implementation of MapReduce computational model Ján Vaňo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

Next Generation of Apache Hadoop MapReduce Owen

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

BIG DATA/ Hadoop Interview Questions.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Introduction to MapReduce and Hadoop

Map reduce Cs 595 Lecture 11.

Big Data is a Big Deal!.

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Hadoop Aakash Kag What Why How 1.

Introduction to Distributed Platforms

Apache hadoop & Mapreduce

An Open Source Project Commonly Used for Processing Big Data Sets

Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure

Large-scale file systems and Map-Reduce

Introduction to MapReduce and Hadoop

Introduction to HDFS: Hadoop Distributed File System

Hadoop Clusters Tess Fulkerson.

Software Engineering Introduction to Apache Hadoop Map Reduce

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

The Basics of Apache Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Hadoop Technopoints.

Introduction to Apache

Lecture 16 (Intro to MapReduce and Hadoop)

CS 345A Data Mining MapReduce This presentation has been altered.

Charles Tappert Seidenberg School of CSIS, Pace University

Apache Hadoop and Spark

Presentation transcript:

School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016 – Poughkeepsie, NY, June , 2016 Marist College HARRISON CARRANZA School of Computer Science and Mathematics Marist College

School of Computer Science and Mathematics Contents WHAT IS AND WHY USE HADOOP?WHAT IS AND WHY USE HADOOP? HISTORY AND WHO USES HADOOPHISTORY AND WHO USES HADOOP KEY HADOOP TECHNOLOGIESKEY HADOOP TECHNOLOGIES HADOOP STRUCTUREHADOOP STRUCTURE RESEARCH AND IMPLEMENTATIONRESEARCH AND IMPLEMENTATION CONCLUSIONSCONCLUSIONS AGENDA

School of Computer Science and Mathematics Open Source Apache ProjectOpen Source Apache Project Hadoop Core includes:Hadoop Core includes: –Distributed File System - distributes data –Map/Reduce - distributes application Written in JavaWritten in Java Runs onRuns on –Linux, Mac OS/X, Windows Contents WHAT IS AND WHY USE HADOOP?

School of Computer Science and Mathematics How do you scale up applications?How do you scale up applications? –Run jobs processing hundreds of terabytes of data –Takes 11 days to read on 1 computer Need lots of cheap computersNeed lots of cheap computers –Fixes speed problem (15 minutes on 1000 computers) –Reliability problems In large clusters, computers fail every dayIn large clusters, computers fail every day Cluster size is not fixedCluster size is not fixed Information on one computer gets lostInformation on one computer gets lost Need common infrastructureNeed common infrastructure –Must be efficient and reliable Contents WHAT IS AND WHY USE HADOOP?

School of Computer Science and Mathematics Yahoo! became the primary contributor in 2006Yahoo! became the primary contributor in 2006 It increased over the next decade up until now & is expected to grow beyond 2018It increased over the next decade up until now & is expected to grow beyond 2018 Contents HISTORY AND WHO USES HADOOP

School of Computer Science and Mathematics Yahoo! deployed large scale science clusters in 2007Yahoo! deployed large scale science clusters in 2007 Tons of Yahoo! Research papers emerge:Tons of Yahoo! Research papers emerge: –WWW –CIKM –SIGIR –VLDB –…… Yahoo! began running major production jobs in Q1 2008Yahoo! began running major production jobs in Q Contents HISTORY AND WHO USES HADOOP

School of Computer Science and Mathematics Amazon/A9Amazon/A9 AOLAOL FacebookFacebook Fox interactive mediaFox interactive media GoogleGoogle IBMIBM New York TimesNew York Times PowerSet (now Microsoft)PowerSet (now Microsoft) QuantcastQuantcast Rackspace/MailtrustRackspace/Mailtrust VeohVeoh Yahoo!Yahoo! New York TimesNew York Times More at at Contents HISTORY AND WHO USES HADOOP

School of Computer Science and Mathematics When you visit Facebook, you are interacting with data processed with Hadoop!When you visit Facebook, you are interacting with data processed with Hadoop! Contents HISTORY AND WHO USES HADOOP

School of Computer Science and Mathematics When you visit Facebook, you are interacting with data processed with Hadoop!When you visit Facebook, you are interacting with data processed with Hadoop! Ads Optimization Search Index Content Feed Processing Contents HISTORY AND WHO USES HADOOP Content Optimization

School of Computer Science and Mathematics Facebook! has ~20,000 machines running HadoopFacebook! has ~20,000 machines running Hadoop The largest clusters are currently 2000 nodesThe largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated)Several petabytes of user data (compressed, unreplicated) Facebook runs hundreds of thousands of jobs every monthFacebook runs hundreds of thousands of jobs every month Contents HISTORY AND WHO USES HADOOP NOWADAYS

School of Computer Science and Mathematics Hadoop CoreHadoop Core –Distributed File System –MapReduce Framework Pig (initiated by Yahoo!)Pig (initiated by Yahoo!) –Parallel Programming Language and Runtime Hbase (initiated by Powerset)Hbase (initiated by Powerset) –Table storage for semi-structured data Zookeeper (initiated by Yahoo!)Zookeeper (initiated by Yahoo!) –Coordinating distributed systems Hive (initiated by Facebook)Hive (initiated by Facebook) –SQL-like query language and metastore Contents KEY HADOOP TECHNOLOGIES

School of Computer Science and Mathematics HDFS is designed to reliably store very large files across machines in a large cluster It is inspired by the Google File System Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size Blocks belonging to a file are replicated for fault tolerance The block size and replication factor are configurable per file Files in HDFS are "write once" and have strictly one writer at any time Hadoop Distributed File System – Goals: Store large data setsStore large data sets Cope with hardware failureCope with hardware failure Emphasize streaming data accessEmphasize streaming data access Contents HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

School of Computer Science and Mathematics Commodity hardwareCommodity hardware –Linux PCs with local 4 disks Typically in 2 level architectureTypically in 2 level architecture –40 nodes/rack –Uplink from rack is 8 gigabit –Rack-internal is 1 gigabit all-to-all Contents HADOOP STRUCTURE

School of Computer Science and Mathematics Single namespace for entire clusterSingle namespace for entire cluster –Managed by a single namenode –Files are single-writer and append-only –Optimized for streaming reads of large files Files are broken in to large blockFiles are broken in to large block –Typically 128 MB –Replicated to several datanodes, for reliability Client talks to both namenode and datanodesClient talks to both namenode and datanodes –Data is not sent through the namenode –Throughput of file system scales nearly linearly with the number of nodes Access from Java, C, or command lineAccess from Java, C, or command line Contents HADOOP STRUCTURE

School of Computer Science and Mathematics Java and C++ APIsJava and C++ APIs –In Java use Objects, while in C++ bytes Each task can process data sets larger than RAMEach task can process data sets larger than RAM Automatic re-execution on failureAutomatic re-execution on failure –In a large cluster, some nodes are always slow or flaky –Framework re-executes failed tasks Locality optimizationsLocality optimizations –Map-Reduce queries HDFS for locations of input data –Map tasks are scheduled close to the inputs when possible Contents HADOOP STRUCTURE

School of Computer Science and Mathematics There were three different multi node clusters produced each utilizing: – Microsoft Windows 7 – Linux Ubuntu – Mac OS X Yosemite Hadoop functions were then compared in each of these three different environments The data that was used in this experiment were from Wikipedia’s data dump site Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics Contents RESEARCH & IMPLEMENTATION

School of Computer Science and Mathematics When name node memory usage is higher than 75%, reading and writing HDFS files through Hadoop API shall fail; caused by Java Hashmap When name node memory usage is higher than 75%, reading and writing HDFS files through Hadoop API shall fail; caused by Java Hashmap Name node’s QPS (query per second) can be optimized by fine-tuning the handler counts higher than the number of tasks that are run in parallel Name node’s QPS (query per second) can be optimized by fine-tuning the handler counts higher than the number of tasks that are run in parallel The memory available affects the namenode’s performance in each OS platform; therefore allocation of memory is essential The memory available affects the namenode’s performance in each OS platform; therefore allocation of memory is essential Since memory has such an effect performance, the operating system utilizes less RAM for background and utility features to yield better results Since memory has such an effect performance, the operating system utilizes less RAM for background and utility features to yield better results Ubuntu Linux would be the ideal environment since it uses the least amount of memory, followed by Mac OSX and then finally Windows Ubuntu Linux would be the ideal environment since it uses the least amount of memory, followed by Mac OSX and then finally Windows ContentsCONCLUSIONS

School of Computer Science and Mathematics THANK YOU!!!