3 Hadoop Using Cygwin What is Cygwin? Hadoop needs Java version 1.6 or higherbin/hadoopbin/hadoop jar hadoop-examples jar Word count input outputWord count exampleTokenization problemModifying the Program
4 Not a backup node/stand by Node HDFS DaemonsDaemonName NodeSecondary Name NodeData NodeHow many?1ManyPurposeFiles Metadata,Block2mapHouse keeping, Transaction log check pointingBlock data(File contents)Name NodeMeta Data in RAMRename new editsReadHeart BeatsCopy Fsimage and editsRoll editsBlock ReportSend New Fs imageRead Data Block 1Data Node:During startup each DataNode connects to the NameNode and performs a handshakeNot a backup node/stand by NodeData Node 1Secondary Name NodeReplay all edits and create new fs image
8 Automatic Parallel Execution in MapReduce (Google) Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job
11 Data Flow in a MapReduce Program in Hadoop 1:manyInputFormatMap functionPartitionerSorting & MergingCombinerShufflingMergingReduce functionOutputFormat
12 Lifecycle of a MapReduce Job Map functionReduce functionRun this program as aMapReduce job
13 Lifecycle of a MapReduce Job Map functionReduce functionRun this program as aMapReduce job
14 Lifecycle of a MapReduce Job TimeInputSplitsReduceWave 1ReduceWave 2MapWave 1MapWave 2Industry wide it is recognized that to manage the complexity of today’s systems, we need to make systems self-managing. IBM’s autonomic computing, Microsoft’s DSI, and Intel’s proactive computing are some of the major efforts in this direction.How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?14
15 Job Configuration Parameters 190+ parameters in HadoopSet manually or defaults are used
17 PIG One frequent complaint about MR is that it’s difficult to program One criticism of MapReduce is that the development cycle is very longAs you implement the program in MapReduce, you’ll have to think at the level of mapper and reducer functions and job chainingPig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig LatinPig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliabilityYahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIGIndeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there
18 PIG::How I look like: Not a variable, relation Loads data file into a relation,with a defined schemaNot a variable, relation
19 Word count example in PIG Text=LOAD ‘text’ USING Textloader()Loads each line as one columnTokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word;Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as wordCOUNT_STAR($1)MR TRANSFORMATIONPIG JOBMR JOBSHDFS
20 PIG Vs HivePig is a new language, easy to learn if you know languages similar to PerlHive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for youHive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL).Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.
21 HIVE(HQL)Hive is a data ware house infrastructure built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop clusterInvented at Facebook for their own problems .SQL like query language(HQL/Hive QL) to retrieve the data and process it.JDBC/ODBC access is providedCurrently used with respect to Hbase
22 HbaseHBase is not about being a high level language that compiles to map-reduce,Hbase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.
23 Sqoop To load bulk data into Hadoop from relational databases Imports individual tables or entire databases to files in HDFSProvides the ability to import from SQL databases straight into your Hive data warehouseImporting this table into HDFS could be done with the command:sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ -- local --hive-import- See more at: