f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35 MB/sec from disk ~four months to read the web  same problem with 1000 machines, < 3 hours

Single-thread performance doesn’t matter We have large problems and total throughput/price more important than peak performance Stuff Breaks – more reliability If you have one server, it may stay up three years (1,000 days) If you have 10,000 servers, expect to lose ten a day “ Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often – software still needs to be fault-tolerant – commodity machines without fancy hardware give better perf/price

What is Hadoop?  It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it  Hadoop is a framework used to have distributed processing of big data which is stored at different physical locations.

 The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.  Rather than rely on hardware to deliver high- availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop Includes HDFS a distributed filesystem Map/Reduce HDFS implements this programming model. It is an offline computing engine

 Hardware failure is the norm rather than the exception.  Moving Computation is Cheaper than Moving Data

HDFS  run on commodity hardware  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware  provides high throughput access to application data  suitable for applications that have large data sets

NameNode and DataNodes  HDFS has a master/slave architecture  NameNode :-manages the file system namespace and regulates access to files by clients  DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on  a file is split into one or more blocks

 these blocks are stored in a set of DataNodes  NameNode executes file system namespace operations like opening, closing, and renaming files and directories  It also determines the mapping of blocks to DataNodes  The DataNodes are responsible for serving read and write requests from the file system’s clients.  The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

 software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner  A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.  The framework sorts the outputs of the maps, which are then input to the reduce tasks  Typically the compute nodes and the storage nodes are the same

 The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node  The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks  The slaves execute the tasks as directed by the master

 applications specify the input/output locations  supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes.  The Hadoop job client then submits the job and configuration to the JobTracker  JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information

LETS Simulate  The MapReduce operates on pairs, that is, the input to the job set of pairs and produces a set of pairs as the output of the job  Process  Consider a simple example  File 1:-Hello World Bye World  File 2:-Hello Hadoop Goodbye Hadoop

 For the given sample input the first map emits:  The second map emits:

After using Cobiner  The output of the first map:  The output of the second map:

 Thus the output of the job is:

References  https://hadoop.apache.org/docs/r1.2.1/ mapred_tutorial.html https://hadoop.apache.org/docs/r1.2.1/ mapred_tutorial.html  http://hadoop.apache.org/docs/r1.2.1/h dfs_design.html http://hadoop.apache.org/docs/r1.2.1/h dfs_design.html  http://www.aosabook.org/en/hdfs.html http://www.aosabook.org/en/hdfs.html  http://hadoop.apache.org/ http://hadoop.apache.org/

Thank You……….

f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Similar presentations

Presentation on theme: "f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35.

Similar presentations

Presentation on theme: "f ACT s  Data intensive applications with Petabytes of data  Web pages - 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35."— Presentation transcript:

Similar presentations

About project

Feedback