{ Tanya Chaturvedi MBA(ISM) 500026401. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

{ Tanya Chaturvedi MBA(ISM) 500026401

Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Large datasets  Terabytes or petabytes of data Large clusters  hundreds or thousands of nodes Hadoop is written in the java programming language and requires Java Runtime Environment (JRE) 1.6 or higher. is Hadoop??? What is Hadoop???

 This technology was invented by Google back in their early days so they could usefully index all the textual and structural information they were and then present meaningful results to the users.  This technology was invented by Google back in their early days so they could usefully index all the textual and structural information they were collecting and then present meaningful results to the users.  Hadoop is based on a simple data model, any data will fit. Innovation

Hadoop Master/Slave Architecture Hadoop is designed as a master slave architecture. Master node Many slave nodes

 Need to process big data.  Need to parallelize computation across thousands of nodes.  Commodity hardware  Large number of low-end cheap machines working in parallel to solve a computing problem.  Small number of high-end expensive machines.  Fault tolerance and automatic recovery  Nodes/tasks will fail and will recover automatically. Design Principles of Hadoop

 Google: Inventors of MapReduce computing paradigm.  Yahoo: index calculation for yahoo search engine.  IBM, Microsoft, Oracle, Apple, HP, Twitter  Facebook, Amazon, AOL, NetFlex  Many others universities and research labs Users of Hadoop

Main Reasons for Using Hadoop

Hadoop Architecture Hadoop framework consists of two main layers 1. Distributed file system (HDFS) 2. Execution engine (MapReduce)

 A small Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes.  Job tracker is the master node. it receives the user’s job it receives the user’s job  Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.

 HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework.  HDFS keeps different copies of data in different locations.  The goal of HDFS is to reduce the impact of power failure or switch failure, so that even if these occur, the data can be available. Hadoop Distributed File System

 Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3).  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Properties of HDFS…

 Hadoop is a framework which provides distributed storage and computational capabilities both.  It is extremely scalable.  HDFS uses large block size which eventually works best when manipulating large data sets.  HDFS maintains different replicas of files ; fault tolerant.  Hadoop uses Mapreduce framework which is batch-based, distributed computing framework. Advantages of Using Hadoop

 Security  Inefficient for handling small files.  Does not offer storage or network level encryption.  Single master model-can result in single point of failure. Limitations of Hadoop

Hadoop Vs. Other Systems

{ Tanya Chaturvedi MBA(ISM) 500026401. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Similar presentations

Presentation on theme: "{ Tanya Chaturvedi MBA(ISM) 500026401. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

{ Tanya Chaturvedi MBA(ISM) 500026401. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

Similar presentations

Presentation on theme: "{ Tanya Chaturvedi MBA(ISM) 500026401. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers."— Presentation transcript:

Similar presentations

About project

Feedback