Understanding Hadoop Mr. Sriram

Understanding Hadoop Mr. Sriram

Objectives What is Hadoop? Characteristics of Hadoop Why Hadoop?
History of Hadoop Hadoop Design Principles Hadoop Eco System Apache Hadoop Understand Hadoop 2.x core components Perform Read and Write in Hadoop Hadoop Distributed File System (HDFS) Understand Rack Awareness concept Network Topology in Hadoop Case Studies of Hadoop Hadoop Distributions

Understanding Hadoop What is Hadoop? Characteristics of Hadoop
Why Hadoop? History of Hadoop Hadoop design principles / advantages of hadoop Hadoop Eco System Apache Hadoop Understand Hadoop 2.x core components Perform Read and Write in Hadoop Hadoop Distributed File System (HDFS) Understand Rack Awareness concept Network Topology in Hadoop Case Studies of Hadoop

What is Hadoop? Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. It is an Open source data management with scale-out storage and distributed processing An Open Source Java implementation that supplies a framework for the development of highly scalable distributed computing applications Top level Apache project that was developed by Doug Cutting Inspired from Google's MapReduce algorithm for running distributed applications that process large amounts of data Can efficiently work on thousands of nodes (distributed system) and petabytes of data Yahoo is the largest contributor to the project and uses it extensively

Characteristics Features of Hadoop
Storing data in different machines – Reliable Using commodity hardware – Economical Easy approach for more machine – Scalable Easily Add/Remove node - Flexible

Characteristics Features of Hadoop
Simple: Hadoop allows users to quickly write efficient parallel code Reliable: Storing data in different machines Flexible: Easily add/remove node Economical: Using Commodity Hardware Scalable: Hadoop scales linearly to handle larger data by adding more nodes to the cluster, Easy approach for more machines Robust: Can handle hardware failures as data is stored in multiple nodes Portable: It's written in Java and the HDFS(Hadoop Distributed File System) is easily portable between platforms Latency: Emphasis on high throughput as opposed to low latency Bring Code to Data rather than Data to Code: Hadoop focuses on moving code rather than data thereby increasing the overall throughput Key Value Pair: Handles data in key/value format instead of using relational tables

Why Hadoop? Hadoop can analyse both structured and unstructured datasets Inexpensive and reliable storage Best solution to large scale problems is to tie together many low end machines as a single functional distributed system. Hadoop is able to do just that ! Exploits the underlying parallelism of CPU cores Scale-out instead of scale-up Can handle hardware failures

Why Hadoop?

History of Hadoop – In short
Dec 2004 –Google GFS paper published July 2005 –Nutch uses MapReduce Feb 2006 –Starts as a Lucene subproject Apr 2007 –Yahoo! on 1000-node cluster Jan 2008 –Became Apache Top Level Project Feb 2008 –Yahoo! claimed to run on a 10,000-core cluster –Last.fm, Facebook, and the New York Times started using Hadoop May 2009 –Hadoop sorts Petabyte in 17 hours

History of Hadoop

History of Hadoop 2003 2004 2006

History of Hadoop 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds, compared to previous record of 297 seconds) Avro and Chukwa became new members of Hadoop Framework family Hadoop's Hbase, Hive and Pig subprojects completed, adding more computational power to Hadoop framework ZooKeeper Completed Hadoop and Hadoop alpha. - Ambari, Cassandra, Mahout have been added 2014 – Hadoop 2.0, Spark has been added

Google Vs Hadoop

Hadoop design principles / Advantages of Hadoop
Facilitate the storage and processing of large and / or rapidly growing data sets Structured and Unstructured data | Simple programming models i.e., Support all the types of datasets Scale-out rather than Scale-up Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption. During installation a directory called config. We have to edit the config like this -> Master Machine – 2,3,4 Slave Machine – 1. Suppose we are adding a machine, when we restart a cluster, master will identifies a new node/slave that was added. Using Balancer Data load get balanced based on the data availability. Bring code to data rather than data to code Use commodity hardware (cheap) Cost is Very Cheap because it was developing with Open Source Framework along with commodity hardware

Fault-Tolerance i.e., One machine fails another machine will take care, processing will continue by non-stop Hadoop library itself has been designed to detect and handle failures at the application layer. Every 30 Seconds will send the block report and heart beat. Nan Node consists of Active Nan Node and Standby Nan Node. Distributed Processing Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores. Basically files are split into blocks. High scalability and availability Compatible Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

Traditional RDBMS Vs Hadoop

Hadoop 1.0 & 2.0 Eco System

Hadoop Eco System.. Hadoop is a system for large scale data processing, which has two main layers called MapReduce, HDFS Data Storage Framework -> MapReduce (Processing) MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework. MapReduce is a framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner. Hints - Split a task across processors | Near the data & assembles the results | Self healing, high bandwidth | clustered storage | Job Tracker manages the task trackers

Hadoop Eco System.. Data Processing Framework -> Hadoop Distributed File System (Storage) The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets. Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers. Hints - Distributed across "Nodes“ | Natively Redundant | Name Node tracks locations

Hadoop Eco System.. The data access framework are :- Pig, Hive, Sqoop, Avro Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS. Spark – A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Avro – A data serialization system

Hadoop Eco System.. The orchestration framework are :- HBase, Chukwa, Flume, ZooKeeper HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications. Chukwa – A data collection system for managing large distributed systems. Flume – A distributed, reliable, available service for efficiently moving large amounts of data as it is produced Ideally suited to gathering logs from multiple systems and inserting them into HDFS as they are generated ZooKeeper – A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information

Apache Hadoop Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware. Every thing is Command Line Interface (CLI) The versions of Apache Hadoop with Enterprise Editions are:- Cloudera Hortonworks MapR

Apache Hadoop The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects at Apache include:- Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.

Apache Hadoop Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure. Chukwa™: A data collection system for managing large distributed systems. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout™: A Scalable machine learning and data mining library. Pig™: A high-level data-flow language and execution framework for parallel computation. Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. ZooKeeper™: A high-performance coordination service for distributed applications. Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

Hadoop Stack

How does Hadoop works? Hadoop that it runs across clustered and low-cost machines. This process includes the following core tasks that Hadoop performs:- Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M). These files are then distributed across various cluster nodes for further processing. HDFS, being on top of the local file system, supervises the processing. Blocks are replicated for handling hardware failure. Checking that the code was executed successfully. Performing the sort that takes place between the map and reduce stages. Sending the sorted data to a certain computer. Writing the debugging logs for each job.

Hadoop 2.x Core Components

Performing Read & Write in Hadoop

HDFS HDFS Introduction HDFS Advantages HDFS Architecture
Communication Protocol Daemons Rack Awareness Block Rebalance HDFS Permissions & Security Disadvantages

HDFS - Introduction HDFS , Hadoop Distributed File System, is a distributed file system which holds a large amount of data, terabytes or even petabytes. Is a distributed, scalable, portable file system. Is based on the design of GFS, Google File System. Is a block structured file system where individual files are broken into blocks of fixed size. Runs in a separate namespace isolated from the contents of your local files. Files are stored in redundant manner to ensure durability against failure. Data will be written to HDFS once, read several times. Updates to existing files in HDFS are not supported; An extension to Hadoop will provide support for appending new data to the ends of file.

HDFS - Advantages Can hold large data sets (terabytes to petabytes) through data distribution among Multiple machines. Highly fault tolerant. High throughput through parallel computing. Streaming access to file system data. Can be built out of low-cost hardware. Processing logic close to the data. Reliability by automatically maintaining multiple copies of data.

HDFS & Other Parallel File System

HDFS - Architecture

HDFS – Communication Protocol
All HDFS communication protocols are layered on top of the TCP/IP protocol. Client establishes a connection to a configurable TCP port on the name node machine, using Client protocol. Data nodes talk to the name node using Data node protocol. Name node only responds to Remote Procedure Call requests issued by data nodes or clients.

Daemons To distribute HDFS data uniformly (equally) across data nodes in a cluster. Daemons are resident programs which constitute to hadoop running. Is a simple interface to run a program using hadoop. Include namenode, datanode, secondary namenode, jobtracker and tasktracker.

HDFS Main Components Name Node Master of the system
Maintains and manages the blocks which are present on the data nodes Identify the machine where the storage need to takes place Keep track of metadata Data Node Slaves which are deployed on each machine and provide the actual storage of data Responsible for serving read and write requests for the clients Every slave node has two daemons running on them that is Data Node and Node Manager in a MultiNode cluster Data Node service for HDFS and Node Manager for processing

HDFS Main Components Resource Manager
Identify the machine where processing needs to take place Node Manager Actual execution Application Master Failure / recover for Job status Monitor the application life cycle Secondary Name Node Takes the backup of Name Node

Name node on Metadata Meta-data in Memory
The entire metadata in main memory No demand paging of FS meta-data Types of Metadata List of files List of Blocks for each file List of Data Node for each block File Attributes, e.g., Accessing time, replication factor A Transaction Log Record file creations, file deletions, etc.,

Replication & Rack Awareness

Group all the machines across different racks Divide the data across the racks Store the replicate copies in other racks, in case of failure get the data from other racks Replicated data across the racks sequentially Note: If you move the file to HDFS then by default 3 copies of file will be shared in different data nodes

Racks can be considered as a set of rows where each row consists of a group of machines or nodes. Large hadoop clusters are arranged in racks. Communication between nodes in the same rack is more faster (higher bandwidth) than that between nodes spread across different racks. There will be replicas of block on multiple racks for improved fault tolerance. HDFS can be made rack-aware by the use of a network topology script. Master node uses the network topology script to map the network topology of the cluster. Network topology script receives IP addresses of machines as inputs and returns output as a list of rack names, one for each input.

Block Rebalance To distribute HDFS data uniformly (equally) across data nodes in a cluster. Block replicas will be spread across the racks to ensure backups in case of rack failure. In case of node failure, preference will be given to replicas on the same rack so tthat cross rack network I/O is reduced. Automatic balancer tool included as part of hadoop intelligently balance blocks across the nodes Perfect balancing is unlike to be achieved. Is more desirable to run the balancing script when the cluster utilization is minimum.

Network Topology in Hadoop
Topology (Arrangement) of the network, affects performance of the Hadoop cluster when size of the Hadoop cluster grows. In addition to the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop cluster formation makes use of network topology. Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be difficult, in Hadoop, network is represented as a tree and distance between nodes of this tree (number of hops) is considered as important factor in the formation of Hadoop cluster. Here, distance between two nodes is equal to sum of their distance to their closest common ancestor. Hadoop cluster consists of data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to processes varies depending upon location of the processes.

Network Topology in Hadoop..

Network Topology in Hadoop..
That is, bandwidth available becomes lesser as we go away from- Processes on the same node Different nodes on the same rack Nodes on different racks of the same data center Nodes in different data centers

HDFS Permissions and Security
Designed to prevent accidental corruption of data. Not a strong security model that guarantees denial of access to unauthorized parties. Each file or directory has 3 permissions; read, write and execute. Identity is not formally authenticated with HDFS, it is taken from an extrinsic source. Username used to start the hadoop process is considered to be the super user for HDFS. Permissions are enabled on HDFS by default.

HDFS Disadvantages Provides streaming read performance at the expense of random seek times to arbitrary positions in files. Does not support existing file updation though future versions are expected to support. Does not provide a mechanism for local caching of data. Individual machines are expected to fail on a frequent basis. Namenode is a single point of failure for an HDFS cluster.

HDFS Terminal Commands – mkdir, touchz, ls, count
Terminal Type admin terminal # | user terminal $ To make a directory $ hadoop fs -mkdir /user/cloudera/Monday To create an empty file $ hadoop fs -touchz /user/cloudera/Monday/one.txt To list number of files and directories present in HDFS location $ hadoop fs -ls /user/cloudera/Monday To count the number of files and directories available in HDFS location $ hadoop fs -count /user/cloudera/Monday

HDFS Terminal Commands - Copy
To copy the file from LFS to HDFS $ hadoop fs -put /home/cloudera/Desktop/two.txt /user/cloudera/Monday (or) $ hadoop fs -copyFromLocal /home/cloudera/Desktop/three.txt /user/cloudera/Monday To copy file from HDFS to LFS $ hadoop fs -get /user/cloudera/Monday/two.txt /home/cloudera/Desktop/Tuesday $ hadoop fs -copyToLocal /user/cloudera/Monday/one.txt /home/cloudera/Desktop/Tuesday

HDFS Terminal Commands – cat, rm
To print the contents of HDFS file: $ hadoop fs -cat /user/cloudera/Monday/two.txt (or ) $ hadoop fs -text /user/cloudera/Monday/two.txt To remove the directory from HDFS location $ hadoop fs -rm -r /user/cloudera/Monday

Case Studies of Hadoop Facebook
Facebook Insights provides developers and website owners with access to real-time analytics related to Facebook activity across websites with social plugins, Facebook Pages, and Facebook Ads. Using anonymized data, Facebook surfaces activity such as impressions, click through rates and website visits. These analytics can help everyone from businesses to bloggers gain insights into how people are interacting with their content so they can optimize their services. The general response of people towards a new product can be obtained by analyzing the status updates. Every day more than 4TB of data is added, so analyzing them is possible only by using a powerful application like Hadoop.

Case Studies of Hadoop Retail (Walmart)
Retail Giants like Walmart uses Bill information collected from their various stores to find the products that are in great demand and also the products that is always bought with another (eg: Bread and Jam) So these data can be used to stock these products together to ensure maximum visibility for the other product.

Case Studies of Hadoop Email (Google)
Hadoop can be effectively used to filter spam mails. A sample set of spam mails are examined and the frequently occurring words are found out. The commonly occurring words in every mail that is sent is checked against the “spam words” and is decided whether it is a spam or not. Mobile Service Provider ( T-Mobile, Lyca Mobile) Every call that is made by a customer is logged. The Service Provider can obtain valuable information from these log files. The offer that is most preferred by the customers and the type of offer which each demo graph uses can be found out.

Hadoop Distributions Apache Hadoop Commercial Distributions Cloudera
Hortonworks MapR Technologies Amazon Web Services Teradata IBM Info Sphere Intel

Apache Hadoop A standard open source Hadoop distribution (Apache Hadoop) includes: The Hadoop MapReduce framework for running computations in parallel. The Hadoop Distributed File System (HDFS). Hadoop YARN –a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications. Hadoop Common, a set of libraries and utilities used by other Hadoop modules. This is only a basic set of Hadoop components; there are other solutions available--such as Apache Hive, Apache Pig, and Apache Zookeeper, etc. --that are widely used to solve specific tasks, speed up computations, optimize routine tasks, etc.

Commercial Hadoop Distributors
Cloudera Hortonworks MapRTechnologies Amazon Web Services IBM InfoSphere TeraData Intel

Cloudera It was founded in 2008.
• It has more than 200 paying customers, some of whom boast deployments of more than 1,000 nodes supporting more than a petabyte of data. • Enterprise customers wanted a management and monitoring tool for Hadoop, so Cloudera built Cloudera Manager. • Enterprise customers wanted a faster SQL engine for Hadoop, so Cloudera built Impala using a massively parallel processing (MPP) architecture.

Hortonworks It was established in 2011.
Of all the players, Hortonworks is closest to the Apache Hadoop open source community with Hortonworks Data Platform (HDP) It pursues deep engineering partnerships with the likes of Microsoft, Teradata, SAP and others.(E.g.., Microsoft Azure -cloud computing environment) Apache Ambari, which provides a Hadoop cluster management console, is a key example of cluster management service provided proprietorially by HW .

MapR It was established in 2011.
MapR Technologies is the third pure-play on the list, but lacks the market presence of Cloudera and Hortonworks. Early on, it began focusing on enterprise features while most enterprises were still evaluating Hadoop in the proof of concept stage. MapR Technologies has added some unique innovations to its Hadoop distribution, including support for Network File System (NFS), running arbitrary code in the cluster, performance enhancements for HBase, as well as high- availability and disaster recovery features.

Amazon Web Services - AWS
Amazon may not be the first thing that springs to mind when you think of Hadoop, but AWS' Elastic MapReduce (EMR) was one of the first commercial Hadoop offerings on the market and leads in global market presence. AWS-EMR is a maintainable framework over EC2 and S3 where you can perform your own Hadoop related operations. S3 is an online file storage web service. AWS' solution road map includes Amazon EMR integration with Amazon Kinesis for stream processing.

IBM Info Sphere IBM doesn't have the depth in the Hadoop community that some of its competitors boast, but it has deep roots in distributed computing. IBM Big insights is Hadoop software package offered by IBM. It has more than 100 Hadoop deployments under its belt, some of which run to petabytes of data. IBM's road map includes continuing to integrate the Big Insights Hadoop solution with related IBM assets like SPSS , BI tools and data management and modelling tools.

Teradata Teradata is a specialist in enterprise data warehouse (EDW) appliances. It has built a strong technical partnership with Hortonworks to offer Hadoop as an appliance. The vendor has deployed the HCatalog, an open-source metadata framework developed by Hortonworks and SQL-H, which allows analysis of the HDFS using industry standard SQL. Teradata currently has fewer than 100 customers for its Hadoop appliance.

Difference between top three enterprise edition providers

Thank You !!!!!!!!!!!

Understanding Hadoop Mr. Sriram

Similar presentations

Presentation on theme: "Understanding Hadoop Mr. Sriram"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding Hadoop Mr. Sriram

Similar presentations

Presentation on theme: "Understanding Hadoop Mr. Sriram"— Presentation transcript:

Similar presentations

About project

Feedback