HADOOP ADMIN: Session -2

HADOOP ADMIN: Session -2
BIG DATA HADOOP ADMIN: Session -2 What is Hadoop?

2 AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons
Hadoop Ecosystem Projects

3 Hadoop Using Cygwin What is Cygwin?
Hadoop needs Java version 1.6 or higher bin/hadoop bin/hadoop jar hadoop-examples jar Word count input output Word count example Tokenization problem Modifying the Program

4 Not a backup node/stand by Node
HDFS Daemons Daemon Name Node Secondary Name Node Data Node How many? 1 Many Purpose Files Metadata,Block2map House keeping, Transaction log check pointing Block data(File contents) Name Node Meta Data in RAM Rename new edits Read Heart Beats Copy Fsimage and edits Roll edits Block Report Send New Fs image Read Data Block 1 Data Node: During startup each DataNode connects to the NameNode and performs a handshake Not a backup node/stand by Node Data Node 1 Secondary Name Node Replay all edits and create new fs image

5 Map Reduce V1 Daemons Job Tracker Task Tracker Job Tracker

6 Word Count over a Given Set of Web Pages
see 1 bob 1 throw see 1 spot 1 run 1 bob 1 run see 2 spot 1 throw 1 see bob throw see spot run Can we do word count in parallel?

7 The MapReduce Framework (pioneered by Google)

8 Automatic Parallel Execution in MapReduce (Google)
Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job

9 MapReduce in Hadoop (1)

10 MapReduce in Hadoop (2)

11 Data Flow in a MapReduce Program in Hadoop
 1:many InputFormat Map function Partitioner Sorting & Merging Combiner Shuffling Merging Reduce function OutputFormat

Lifecycle of a MapReduce Job
Map function Reduce function Run this program as a MapReduce job

Lifecycle of a MapReduce Job
Map function Reduce function Run this program as a MapReduce job

14 Lifecycle of a MapReduce Job
Time Input Splits Reduce Wave 1 Reduce Wave 2 Map Wave 1 Map Wave 2 Industry wide it is recognized that to manage the complexity of today’s systems, we need to make systems self-managing. IBM’s autonomic computing, Microsoft’s DSI, and Intel’s proactive computing are some of the major efforts in this direction. How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? 14

15 Job Configuration Parameters
190+ parameters in Hadoop Set manually or defaults are used

16 Hadoop Ecosystem/Sub Projects
PIG Hbase Sqoop Hive

17 PIG One frequent complaint about MR is that it’s difficult to program
One criticism of MapReduce is that the development cycle is very long As you implement the program in MapReduce, you’ll have to think at the level of mapper and reducer functions and job chaining Pig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007 Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability Yahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIG Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there

18 PIG::How I look like: Not a variable, relation
Loads data file into a relation,with a defined schema Not a variable, relation

19 Word count example in PIG
Text=LOAD ‘text’ USING Textloader()Loads each line as one column Tokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as word COUNT_STAR($1) MR TRANSFORMATION PIG JOB MR JOBS HDFS

20 PIG Vs Hive Pig is a new language, easy to learn if you know languages similar to Perl Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

21 HIVE(HQL) Hive is a data ware house infrastructure built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop cluster Invented at Facebook for their own problems . SQL like query language(HQL/Hive QL) to retrieve the data and process it. JDBC/ODBC access is provided Currently used with respect to Hbase

22 Hbase HBase is not about being a high level language that compiles to map-reduce, Hbase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.

23 Sqoop To load bulk data into Hadoop from relational databases
Imports individual tables or entire databases to files in HDFS Provides the ability to import from SQL databases straight into your Hive data warehouse Importing this table into HDFS could be done with the command: sqoop --connect jdbc:mysql:// --table USERS \ -- local --hive-import- See more at:

