Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloudera Certification for Apache Hadoop Admin

Similar presentations


Presentation on theme: "Cloudera Certification for Apache Hadoop Admin"— Presentation transcript:

1 Cloudera Certification for Apache Hadoop Admin

2 Curriculum HDFS (17%) YARN and MapReduce version 2 (MRv2) (17%)
Hadoop Cluster Planning (16%) Hadoop Cluster Installation and Administration (25%) Resource Management (10%) Monitoring and Logging (15%) Miscellaneous

3 HDFS (17%) Describe the function of HDFS Daemons
Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing. Identify current features of computing systems that motivate a system like Apache Hadoop. Classify major goals of HDFS Design Given a scenario, identify appropriate use case for HDFS Federation Identify components and daemon of an HDFS HA-Quorum cluster Analyze the role of HDFS security (Kerberos) Determine the best data serialization choice for a given scenario Describe file read and write paths Identify the commands to manipulate files in the Hadoop File System Shell

4 Describe the function of HDFS Daemons
Datanode (Stores data in the form of files) Namenode (In memory representation of HDFS file metadata) Secondary namenode (Helper to Namenode)

5 Hadoop Architecture Helper Processing Storage Metadata Storage
Processing Master

6 Hadoop Architecture HDFS Map Reduce HDFS HDFS HDFS Map Reduce HDFS

7 HDFS Secondary Namenode Map Reduce Datanode Namenode Datanode

8 Typical Hadoop Cluster
HDFS HDFS HDFS HDFS HDFS HDFS HDFS Network Switch(es) HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS HDFS

9 Typical Hadoop Cluster
DN DN DN NN DN DN DN Network Switch(es) SNN DN DN DN DN DN DN DN DN DN DN DN DN DN DN DN DN DN DN

10 Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing. Hadoop Cluster Data Storage (HDFS) Files and Blocks Fault Tolerance - Replication Factor Metadata Datanode Namenode and Secondary Namenode Heartbeat Checksum Namenode recovery (fsimage, editlogs and safemode) Data Processing (Map Reduce – classic/YARN) Mappers and Reducers MRv1/Classic Job Tracker Task Tracker MRv2/YARN Resource Manager Node Manager

11 Hadoop Cluster (Single node/Cloudera VM)
Metadata Storage Processing Helper Processing Master

12 Hadoop Cluster (Single node/Cloudera VM)
Storage Namenode Datanode S. Namenode File Name: deckofcards.txt Block Name: BLK_XXX1 Contents: BLACK|SPADE|2 BLACK|SPADE|3 BLACK|SPADE|4 BLACK|SPADE|5 BLACK|SPADE|6 BLACK|SPADE|7 BLACK|SPADE|8 BLACK|SPADE|9 BLACK|SPADE|10

13 Hadoop Cluster (Single node/Cloudera VM)
File Name|Block Name|Location deckofcards.txt|blk_XXX1|node01 Storage Namenode Datanode Processing S. Namenode File Name: deckofcards.txt Block Name: blk_XXX1 Block size: default (128 MB) Replication Factor: 3 (but only one copy will be there on single node) Contents: BLACK|SPADE|2 BLACK|SPADE|3 BLACK|SPADE|4 BLACK|SPADE|5 BLACK|SPADE|6 BLACK|SPADE|7 BLACK|SPADE|8 BLACK|SPADE|9 BLACK|SPADE|10 Namenode will contain file name, all block names and block location (in memory) There will be one or more files created with prefix blk_* One file will be splitted into multiple blocks. Processing will be covered later

14 Hadoop Cluster Helper Storage Processing Metadata Storage Processing
Processing Master

15 Hadoop Cluster (Storage)
File Name|Block Name|Location deckofcards.txt|blk_XXX1|node01 deckofcards.txt|blk_XXX1|node02 deckofcards.txt|blk_XXX1|node03 Datanode blk_XXX1 Processing Namenode blk_XXX1 Processing Secondary namenode Datanode Datanode blk_XXX1 Processing File Name: deckofcards.txt (which is few bytes) Block Name: blk_XXX1 Block size: default (128 MB) Replication Factor: 3 (Now there will be 3 copies of each block) Contents (sample): BLACK|SPADE|2 BLACK|SPADE|3 BLACK|SPADE|4 BLACK|SPADE|5 BLACK|SPADE|6 BLACK|SPADE|7 BLACK|SPADE|8 BLACK|SPADE|9 BLACK|SPADE|10 Namenode will contain file name, all block names and block location (in memory) There will be file created with prefix blk_* depending up on the size of the file One file will be splitted into multiple blocks. Processing will be covered later

16 Hadoop Cluster (Storage)
File Name|Block Name|Location deckofcards.txt|blk_XXX1|node01 deckofcards.txt|blk_XXX1|node02 deckofcards.txt|blk_XXX1|node03 deckofcards.txt|blk_XXX2|node01 deckofcards.txt|blk_XXX2|node02 deckofcards.txt|blk_XXX2|node03 Datanode blk_XXX1 blk_XXX2 Processing Namenode blk_XXX1 blk_XXX2 Processing Secondary namenode Datanode Datanode blk_XXX1 blk_XXX2 Processing File Name: deckofcards.txt (200 MB) Block Name: blk_XXX1 (128 MB), blk_XXX2 (72 MB) Block size: default (128 MB) Replication Factor: 3 (Now there will be 3 copies of each block) Contents (sample): BLACK|SPADE|2 BLACK|SPADE|3 BLACK|SPADE|4 BLACK|SPADE|5 BLACK|SPADE|6 BLACK|SPADE|7 BLACK|SPADE|8 BLACK|SPADE|9 BLACK|SPADE|10 Namenode will contain file name, all block names and block location (in memory) There will be file created with prefix blk_* depending up on the size of the file One file will be splitted into multiple blocks. Processing will be covered later

17 Files and Blocks File abstraction using blocks
File abstraction means a file can be larger than any one hard disk in the cluster It can be achieved by Network file system as well as HDFS HDFS and other distributed file systems typically uses local file system over network file system Files are distributed on HDFS based on dfs.blocksize

18 Fault Tolerance – Replication Factor
Fault tolerance – HDFS is fault tolerant HDFS does not use RAID (RAID only solves Hard disk failure, mirroring is expensive and striping is slow) HDFS uses mirroring and dfs.replication controls how many copies should be made (default 3). HDFS mirroring/replication solves Disk failure as well as any other hardware failure (except network failures) Network failures are addressed using multiple racks with multiple switches

19 Metadata Files are divided into blocks based up on dfs.blocksize (default 128 MB) Each block will have multiple copies and stored in the servers designated as datanodes. It is controlled by parameter called dfs.replication (default 3) What is file metadata? HDFS file is logical Each block will have block id and multiple copies Each copy will be stored in separate data node Mapping between file, block and block location is metadata of a file Also file permissions, directories etc All these will be stored in in-memory of Namenode

20 Data node Actual contents of the files are stored as blocks on the slave nodes Blocks are simply files on the slave nodes’ underlying file system Named blk_xxxxxxx Nothing on the slave node provides information about what underlying file the block is a part of That information is only stored in the NameNode’s metadata Each block is stored on multiple different nodes for redundancy Default is three replicas Each slave node runs a DataNode (DN) daemon Controls access to the blocks Communicates with the NameNode

21 Data node (Slave) Files (uses replication factor) Blocks Checksum
Processes (Stand Alone) 1) Data Node

22 Data node (Slave) Files (uses replication factor)
1) dfs.datanode.data.dir Processes (Stand Alone) 1) proc_datanode

23 Name node Name node is single point of failure
The NameNode (NN) stores all metadata (in memory) Information about file locations in HDFS Information about file ownership and permissions Names of the individual blocks Locations of the blocks Metadata is stored on disk and read when the NameNode daemon starts up Filename is fsimage Note: block locations are not stored in fsimage Changes to the metadata are made in RAM Changes are also written to a log file on disk called edits – Full details later

24 Name node (Master) Namespace (Memory) Files (Must be mirrored)
1) File locations in HDFS 2) File ownership 3) File permissions 4) Names of the individual blocks 5) Locations of the blocks Files (Must be mirrored) FS Image Edit Logs Processes (Stand Alone) 1) Name Node (proc_namenode)

25 Name node (Master) Namespace (Memory) Files (Must be mirrored)
1) dfs.namenode.name.dir Processes (Stand Alone) 1) proc_namenode

26 Name node (Master) Configuration file for name node hdfs-site.xml (typically located at /etc/hadoop/conf) dfs.namenode.name.dir parameter in hdfs-site.xml determines location of the edit logs and fs image proc_namenode is name of the process

27 Secondary Name node (Helper)
Namespace (Memory) Files Edit logs FS Image Processes (Stand Alone) 1) proc_secondarynamenode

28 Secondary Name node (Helper)
Configuration file for name node hdfs-site.xml (typically located at /etc/hadoop/conf) dfs.namenode.checkpoint.* parameters in name node’s hdfs- site.xml determines interoperability between name node and secondary name node proc_secondarynamenode is name of the process

29 Secondary name node (Helper)
The Secondary NameNode (2NN) is not-a failover NameNode! It performs memory intensive administrative functions for the NameNode NameNode keeps information about files and blocks (the metadata) in memory NameNode writes metadata changes to an editlog Secondary NameNode periodically combines a prior filesystem snapshot and editlog into a new snapshot New snapshot is transmitted back to the NameNode Note that fsimage do not contain the locations for the blocks. Namenode namespace will be built in-memory in safe mode when data nodes are introduced to cluster. Secondary NameNode should run on a separate machine in a large installation It requires as much RAM as the NameNode

30 Determine how HDFS stores, reads, and writes files.

31 Heartbeat and block report
Datanode sends heartbeat every 3 seconds to Namenode Heartbeat interval is controlled by dfs.heartbeat.interval Along with heartbeat, Datanode sends information such as Disk capacity Current activity Also data node sends periodic block report (default 6 hours) to Namenode (dfs.blockreport.*)

32 Checksum Checksum is used to ensure blocks or files are not corrupted while files are being read from HDFS or written to HDFS

33 Namenode Recovery and Secondary Namenode
Editlogs FSImage It only contains files and blocks (to reduce the size of the FSImage and improve restore time – which is serial in nature) It does not contain block locations Secondary Namenode A helper process which merges latest edit log with last snapshot of FSImage and create new one Recovery process Namenode starts in safemode Restores latest FSImage Recovers using latest edit log Namenode will do a roll call to datanode to determine the locations of the blocks

34 HDFS - Important parameters (Hadoop cluster with one name node)
File Name Parameter Name Parameter value Description core-site.xml fs.defaultFS/fs.default.name hdfs://<namenode_ip>:8020 Namenode ip address or nameservice (HA config) hdfs-site.xml dfs.block.size, dfs.blocksize 128 MB Block size at which files will be stored physically. dfs.replication 3 Number of copies per block of a file for fault tolerance dfs.namenode.http-address :50070 Namenode Web UI. By default it might use ip address of namenode. dfs.datanode.http.address :50075 Datanode Web UI dfs.name.dir, dfs.namenode.name.dir <directory_location> Directory location for FS Image and edit logs on name node dfs.data.dir, dfs.datanode.data.dir Directory location for storing blocks on data nodes fs.checkpoint.dir, dfs.namenode.checkpoint.dir Directory location which will be used by secondary namenode for checkpoint. fs.checkpoint.period, dfs.namenode.checkpoint.period 1 hour Checkpoint (merging edit logs with current fs image to create new fs image) interval. dfs.namenode.checkpoint.txns Checkpoint (merging edit logs with current fs image to create new fs image) transactions.

35 Data processing * Will be covered later
Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing. Data processing Distributed Scalable Data locality Fault tolerant * Will be covered later

36 Mappers and Reducers MRv1/Classic MRv2/YARN
Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing. Mappers and Reducers MRv1/Classic MRv2/YARN

37 Hadoop Cluster (Processing)
Mappers Each map task operates on one block (typical) or more (split size) Typically tries to process block on the same node where map is running. Shuffle & Sort Happens after all mappers are done before reduce phase is started. Sorts and consolidates all the intermediate data for the reducer Reducers Operates on shuffled/sorted map output Writes back the output to HDFS (typically)

38 Hadoop Cluster (Processing)
There are two frameworks to transition the job into tasks to process the data MRv1 “Classic” Job Tracker (permanent – per cluster) Task Tracker (permanent – per node) Predetermined number of mappers and reducers MRv2/YARN Resource Manager (permanent – per cluster) Node Manager (permanent – per node) Application Master (transient – per job) Container (transient – per job per node) There is separate item which will cover MRv1 and MRv2 in detail, for now just understand that there are 2 frameworks to process data and the daemon processes

39 Mappers and Reducers Mappers
Number of mappers is determined by framework based up on block size and split size Uses data locality Logic to filter, row level transformations are implemented in the map function Mapper tasks execute map function Shuffle & Sort Taken (typically) care by Hadoop MapReduce framework Enhance or customize the capability in the form of custom partitioners and custom comparators. Reducers Developers needs to determine number of reducers. Can be pre-determined for some of the cases If the report has to be generated by year, then number of reducers can be number of years you want to generate report If the report has to be generated for number of regions or states , then number of reducers can be number of regions or states. Logic to implement aggregations, joins etc are implemented in the reduce function Reducer tasks execute reduce function

40 Identify current features of computing systems that motivate a system like Apache Hadoop.
RDBMS (Relational Database Management Systems) Designed and developed for operational and transactional applications Not efficient for batch processing Not linearly scalable Grid Computing (In-memory) MPP (Massively Parallel Processing)

41 RDBMS Traditional RDBMS Hadoop Data size Gigabytes Petabytes Access
Traditional RDBMS Hadoop Data size Gigabytes Petabytes Access Interactive and batch (small) Batch (large) Updates Read and write many times Write once, read many times Structure Static schema Dynamic schema Integrity High Low Scaling Nonlinear Linear

42 Apache Hadoop Distributed File System Distributed Processing
Data Locality Scalable Supports Structured, Unstructured and Semi-structured data Cost effective Open source Proven on commodity hardware

43 Classify major goals of HDFS Design
Distributed – using block size, default 128 MB Hardware Failure – detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS – using replication factor, default 3. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. Large Data sets – tuned for large data sets Simple Coherency Model – write-once-read-many, HDFS files are immutable Data Locality (Moving computation to data) Portability Across Heterogeneous Hardware and Software Platforms (Logical file system)

44 Given a scenario, identify appropriate use case for HDFS Federation
HDFS (two main layers) Namespace manages directories, files and blocks. It supports file system operations such as creation, modification, deletion and listing of files and directories. Block Storage Block Management maintains the membership of datanodes in the cluster. It supports block-related operations such as creation, deletion, modification and getting location of the blocks. It also takes care of replica placement and replication. Physical Storage stores the blocks and provides read/write access to it.

45 Given a scenario, identify appropriate use case for HDFS Federation

46 Given a scenario, identify appropriate use case for HDFS Federation
HDFS (Limitations) Namespace Scalability Performance Isolation

47 Given a scenario, identify appropriate use case for HDFS Federation
HDFS Federation (Namenode) Namenode Scalability Better Performance Isolation HDFS Federation (implementation) Multiple namespaces Multiple namenodes Same set of datanodes for all namespaces Block Pool Namespace Volume (Block Pool and associated Namespace) Self contained

48 Given a scenario, identify appropriate use case for HDFS Federation

49 Identify components and daemons of an HDFS HA-Quorum cluster
Namenode recovery and secondary namenode Editlogs FSImage It only contains files and blocks (to reduce the size of the FSImage and improve restore time – which is serial in nature) It does not contain block locations Editlogs are merged into FSImage at regular intervals (checkpointing) Secondary Namenode A helper process which merges latest edit log with last snapshot of FSImage and create new one Recovery process Namenode starts in safemode Restores latest FSImage Recovers using latest edit log Namenode will do a roll call to datanode to determine the locations of the blocks

50 Identify components and daemon of an HDFS HA-Quorum cluster
Namenode recovery and secondary namenode (limitations) Checkpointing is resource intensive If ip address is changed, then failover might not be transparent Recovery is time consuming

51 Identify components and daemon of an HDFS HA-Quorum cluster
HDFS HA – Quorum cluster components Active (one) and Standby (one) Namenodes Journal Nodes (Journal directories – at least 3 or more in odd number) Zookeeper (quorum) HDFS HA – Quorum cluster scenarios High Availability Transparent Failover

52 Identify components and daemon of an HDFS HA-Quorum cluster
HDFS HA – Quorum cluster components Active and Standby Namenodes HA is different than Secondary namenode or Federation Only one node will be active Standby node will get edit logs at regular intervals from journal nodes (journal nodes get edit logs from active nodes) Shared edits Shared Storage Uses NFS to store edit logs in shared location by both Namenodes (Active and Passive) Active Namenode writes to shared edit logs Passive Namenode reads from shared edit logs and apply Journal Nodes (Journal directories) Typically 3 (when greater than 3 it needs to be odd number) Active namenode will write edit logs to majority of the configured journal nodes Standby namenode will read edit logs from any of the surviving journal node Zookeeper (quorum) It will be running on typically 3 or 5 nodes (proc_zkfc) As proc_zkfc is lightweight it can be deployed on both Namenodes and ResourceManager

53 Identify components and daemon of an HDFS HA-Quorum cluster

54 Analyze the role of HDFS security (Kerberos)

55 Determine the best data serialization choice for a given scenario
Writable Avro – Typically to use other languages to store data in HDFS Java Serialization – will not be used as it is heavy compared to Writable and Avro (it is not compact, fast, extensible and interoperable – will see these characteristics later)

56 Serialization and Deserialization
Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Deserialization is the reverse process of turning a byte stream back into a series of structured objects. In the context of Hadoop, Serialization is used for inter-process communication (between mappers and reducers) as well as while storing data persistently. RPC (Remote Procedure Calls) is used for inter-process communication. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. Serialization for RPC should be Compact, Fast, Extensible and Interoperable

57 Serialization in Hadoop
Writable Interface The Writable interface defines two methods—one for writing its state to a DataOutput binary stream and one for reading its state from a DataInput binary stream: write readFields There are bunch of classes in Hadoop API which implement writable interface such as IntWritable, Text etc

58 Writable types

59 Serialization Frameworks (eg: avro)
It is not mandatory to implement or use writable Hadoop has API for pluggable serialization framework Package: org.apache.hadoop.io.serializer It has class WritableSerialization for implementing Serialization for Writable types Parameter for customizing serialization: io.serializations Cloudera set this value to both writable and avro serialization, which means that both hadoop writable objects and avro objects can be serialized and deserialized out of the box

60 Avro primitive types

61 Describe file read and write paths

62 Determine how HDFS stores, reads, and writes files.
If a file f1 of 400 MB has to be stored on a cluster with block size 128 MB, file will be divided into 4 blocks (3 128 MB and 1 16 MB). HDFS permits to read a file that is being written. HDFS uses checksum to validate both reads and writes at each block level. Checksums are stored along with the blocks. HDFS logs the verification details persistently which assists in identifying bad disks.

63 Checksum Checksum files are used for data integrity.
Whenever a block of a file is written a checksum file will be generated. When client reads block HDFS passes pre-computed check sum to the client to ensure data integrity. HDFS logs the verification details persistently which assists in identifying bad disks.

64 Describe file read and write paths
Anatomy of file read Anatomy of file write

65 Anatomy of file read

66 Anatomy of file read The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1) in Figure 3-2). DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client (according to the topology of the cluster’s network; see Network Topology and Hadoop). If the client is itself a datanode (in the case of a MapReduce task, for instance), the client will read from the local datanode if that datanode hosts a copy of the block. The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.

67 Anatomy of file read The client then calls read() on the stream (step 3). DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream (step 4). When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5). This happens transparently to the client, which from its point of view is just reading a continuous stream. Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream (step 6).

68 Anatomy of file read During reading, if the DFSInputStream encounters an error while communicating with a datanode, it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly retry them for later blocks. The DFSInputStream also verifies checksums for the data transferred to it from the datanode. If a corrupted block is found, it is reported to the namenode before the DFSInputStream attempts to read a replica of the block from another datanode. One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the datanodes in the cluster. Meanwhile, the namenode merely has to service block location requests (which it stores in memory, making them very efficient) and does not, for example, serve data, which would quickly become a bottleneck as the number of clients grew.

69 Anatomy of file write

70 Anatomy of file write The client creates the file by calling create() on DistributedFileSystem (step 1) DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2). Name node performs checks such as permissions, file exists etc As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline (step 4).

71 Anatomy of file write DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5). If a datanode fails while data is being written to it, then the following actions are taken, which are transparent to the client writing the data. First, the pipeline is closed, and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets. The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and the remainder of the block’s data is written to the two good datanodes in the pipeline. The namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node. Subsequent blocks are then treated as normal.

72 Anatomy of file write It’s possible, but unlikely, that multiple datanodes fail while a block is being written. As long as dfs.replication.min replicas (which default to one) are written, the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to three). When the client has finished writing data, it calls close() on the stream (step 6). This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete (step 7). The namenode already knows which blocks the file is made up of (via DataStreamer asking for block allocations), so it only has to wait for blocks to be minimally replicated before returning successfully.

73 Identify the commands to manipulate files in the Hadoop File System Shell
hadoop fs (Used to manage user spaces, directories and files) hadoop jar (Used to submit map reduce jobs) hdfs fsck (Used for administration of the cluster)

74 HDFS - Important parameters (Hadoop cluster with one name node)
File Name Parameter Name Parameter value Description core-site.xml fs.defaultFS/fs.default.name hdfs://<namenode_ip>:8020 Namenode ip address or nameservice (HA config) hdfs-site.xml dfs.block.size, dfs.blocksize 128 MB Block size at which files will be stored physically. dfs.replication 3 Number of copies per block of a file for fault tolerance dfs.namenode.http-address :50070 Namenode Web UI. By default it might use ip address of namenode. dfs.datanode.http.address :50075 Datanode Web UI dfs.name.dir, dfs.namenode.name.dir <directory_location> Directory location for FS Image and edit logs on name node dfs.data.dir, dfs.datanode.data.dir Directory location for storing blocks on data nodes fs.checkpoint.dir, dfs.namenode.checkpoint.dir Directory location which will be used by secondary namenode for checkpoint. fs.checkpoint.period, dfs.namenode.checkpoint.period 1 hour Checkpoint (merging edit logs with current fs image to create new fs image) interval. dfs.namenode.checkpoint.txns Checkpoint (merging edit logs with current fs image to create new fs image) transactions.

75 Exercise Understand daemon processes (Namenode, Secondary Namenode, Datanode) Commands to stop and start HDFS daemons Copying data back and forth to HDFS Understand parameter files and data files Restore and recovery of Namenode Important parameters and their defaults (dfs.blocksize, dfs.replication) Namenode Web UI

76 Interview questions What are different Hadoop, HDFS and Map Reduce daemons? How data can be copied in and out of HDFS? What is Namenode Web UI and what is default port number? How do you restore and recover namenode?

77 YARN and MapReduce version 2 (MRv2) (17%)
Understand how upgrading a cluster from Hadoop 1 to Hadoop 2 affects cluster settings Understand how to deploy MapReduce v2 (MRv2 / YARN), including all YARN daemons Understand basic design strategy for MapReduce v2 (MRv2) Determine how YARN handles resource allocations Identify the workflow of MapReduce job running on YARN Determine which files you must change and how in order to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN.

78 Hadoop Cluster (Processing)
Mappers and Reducers are the tasks which processes data There are two frameworks to transition the job into tasks to process the data MRv1 “Classic” (Not covered in detail) Job Tracker (permanent – per cluster) Task Tracker (permanent – per node) Predetermined number of mappers and reducers MRv2/YARN Resource Manager (permanent – per cluster) Node Manager (permanent – per node) Application Master (transient – per job) Container (transient – per job per node) Here we will be covering YARN in more detail as it is path forward and default starting from Hadoop 2.x version.

79 Understand how upgrading a cluster from Hadoop 1 to Hadoop 2 affects cluster settings
Component Hadoop 1 Hadoop 2 HDFS Single Namenode HA and Federation Map Reduce Job Management MRv1 YARN

80 Understand how upgrading a cluster from Hadoop 1 to Hadoop 2 affects cluster settings
Hadoop 1 – By default uses MRv1/Classic for job management Parameter files – mapred-site.xml Daemon Processes (classic) – Job Tracker, Task Tracker Hadoop 2 – By default uses MRv2/YARN for job management Parameter files – mapred-site.xml and yarn-site.xml Daemon Processes (YARN) – Resource Manager, Node Manager

81 Understand how to deploy MapReduce v2 (MRv2 / YARN), including all YARN daemons
Resource Manager (typically 1, but can configure HA) Node Manager App timeline server Job history server

82 Understand how to deploy MapReduce v2 (MRv2 / YARN), including all YARN daemons
Parameter files mapred-site.xml yarn-site.xml Important parameters for YARN Starting YARN daemons Using Cloudera Manager Using command line

83 Important parameters in MRv2/YARN
File Name Parameter Name Parameter value Description yarn-site.xml yarn.resourcemanager.address <ip_address>:<port> Resource Manager ip and port yarn.resourcemanager.webapp.address Resource Manager web UI ip and port yarn.scheduler.minimum-allocation-mb 1024 Minimum total memory for containers on each of the nodes yarn.scheduler.maximum-allocation-mb 4096 Maximum total memory for containers on each of the nodes yarn.scheduler.minimum-allocation-vcores 1 Minimum number of virtual cores on each of the nodes yarn.scheduler.maximum-allocation-vcores 4 Maximum number of virtual cores on each of the nodes yarn.resourcemanager.scheduler.class Class which determines scheduler – Fair or capacity

84 Important parameters in MRv2/YARN
File Name Parameter Name Parameter value Description mapred-site.xml mapreduce.framework.name yarn mapreduce.jobhistory.webapp.address <ip_address>:<port> Job history server Web UI IP address and port number yarn.app.mapreduce.am.* Parameters related to application master mapreduce.map.java.opts JVM Heap size for child task of map container mapreduce.reduce.java.opts JVM Heap size for child task of reduce container mapreduce.map.memory.mb Size of container for map task mapreduce.map.cpu.vcores 1 Number of virtual cores required to run each map task mapreduce.reduce.memory.mb Size of container for reduce task mapreduce.reduce.cpu.vcores Number of virtual cores required to run each reduce task

85 Understand basic design strategy for MapReduce v2 (MRv2)
Hadoop 1.0 MapReduce (cluster resource management & data processing) HDFS (distributed, redundant and reliable storage) Hadoop 2.0 YARN (cluster resource management) HDFS2 (distributed, redundant and reliable storage with highly available namenode) (data processing) Others (non map reduce based data processing)

86 Determine how YARN handles resource allocations
Question: How YARN handles Resource Allocations? Answer: Using Resource Manager, Node Manager and per job application master (unlike job tracker and task tracker in MRv1/classic). We need to define several parameters for resource allocation (CPU/cores and Memory). yarn-site.xml will have parameters at node level mapred-site.xml will have parameters at task level

87 Important parameters in MRv2/YARN
File Name Parameter Name Parameter value Description yarn-site.xml yarn.resourcemanager.address <ip_address>:<port> Resource Manager ip and port yarn.resourcemanager.webapp.address Resource Manager web UI ip and port yarn.nodemanager.resource.memory-mb 8096 Minimum total memory for containers on each of the nodes yarn.nodemanager.resource.cpu-vcores 4 Maximum total memory for containers on each of the nodes yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 4096 yarn.scheduler.minimum-allocation-vcores 1 Minimum number of virtual cores on each of the nodes yarn.scheduler.maximum-allocation-vcores Maximum number of virtual cores on each of the nodes yarn.resourcemanager.scheduler.class Class which determines scheduler – Fair or capacity

88 Important parameters in MRv2/YARN
File Name Parameter Name Parameter value Description mapred-site.xml mapreduce.framework.name yarn mapreduce.jobhistory.webapp.address <ip_address>:<port> Job history server Web UI IP address and port number yarn.app.mapreduce.am.* Parameters related to application master mapreduce.map.java.opts 0.8 * mapreduce.map.memory.mb JVM Heap size for child task of map container mapreduce.reduce.java.opts 0.8 * mapreduce.reduce.memory.mb JVM Heap size for child task of reduce container mapreduce.map.memory.mb Size of container for map task mapreduce.map.cpu.vcores 1 Number of virtual cores required to run each map task mapreduce.reduce.memory.mb Size of container for reduce task mapreduce.reduce.cpu.vcores Number of virtual cores required to run each reduce task

89 Hadoop Cluster – Processing (MRv1)
Mappers Task Tracker Reducers Job Tracker Mappers Task Tracker Reducers

90 Hadoop Cluster – Processing (MRv2/YARN)
Containers Mappers Reducers App Master Node Manager Resource Manager Containers Mappers Reducers App Master Node Manager

91 Hadoop Cluster – Processing (MRv2/YARN)

92 Resource Manager It manages nodes by tracking heartbeats from NodeManagers It manages containers Handles application master requests for resources (like providing inputs for creation of containers) De-allocates expired or completed containers It manages per job application masters Creates containers for application masters and also tracks their heartbeats It also manages security (if Kerberos is enabled)

93 Node Manager Communicates with Resource Manager. It sends information about node resources, heartbeats, container status etc. Manages processes in containers Launches Application Masters on request from Resource Manager Launches containers (mappers/reducers) on request from Application Master Monitors resource usage by containers(mappers/reducers) Provides logging services to applications. It aggregates logs for an application and saves those logs to HDFS. Runs auxiliary services Maintains node level security (ACLs)

94 Application Master It will be created per job
Keep track of progress of the job

95 Identify the workflow of MapReduce job running on YARN

96 Determine which files you must change and how in order to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN. MRv1 mapred-site.xml MRv2 mapred-site.xml and yarn-site.xml MRv1 to MRv2 Set framework to yarn in mapred-site.xml Parameter file mapred-site.xml should not have any parameters related to yarn- site.xml Define resource manager, node manager and other YARN related parameters in yarn- site.xml Define core mapper and reducer related parameters in mapred-site.xml Job history server needs to be defined in mapred-site.xml to aggregate logs to job history server.

97 Important Parameters in MRv1/Classic
File Name Parameter Name Parameter value Description mapred-site.xml mapred.job.tracker <ip_address>:8021 Job Tracker ip address and port number mapred.job.tracker.http.address <ip_address>:50030 Job tracker web UI ip address and port number mapred.system.dir HDFS directory to store Map Reduce control files mapred.local.dir Local directory to store intermediate data files (map output) mapred.jobtracker.taskScheduler Default is FIFO – Fair and Capacity are the viable options for production deployments mapred.queue.names default Can provide multiple queue names to set priorities while submitting the jobs mapred.tasktracker.map.tasks.maximum Maximum Map slots per task tracker mapred.tasktracker.reduce.tasks.maximum Maximum Reduce slots per task tracker mapred.reduce.tasks Reduce tasks per job

98 Important parameters in MRv2/YARN
File Name Parameter Name Parameter value Description yarn-site.xml yarn.resourcemanager.address <ip_address>:<port> Resource Manager ip and port yarn.resourcemanager.webapp.address Resource Manager web UI ip and port yarn.nodemanager.resource.memory-mb 8096 Memory allocated for each of the nodemanager yarn.nodemanager.resource.cpu-vcores 4 Vcores allocated for each of the node manager yarn.scheduler.minimum-allocation-mb 1024 Minimum total memory for containers on each of the nodes yarn.scheduler.maximum-allocation-mb 4096 Maximum total memory for containers on each of the nodes yarn.scheduler.minimum-allocation-vcores 1 Minimum number of virtual cores on each of the nodes yarn.scheduler.maximum-allocation-vcores Maximum number of virtual cores on each of the nodes yarn.resourcemanager.scheduler.class Class which determines scheduler – Fair or capacity

99 Important parameters in MRv2/YARN
File Name Parameter Name Parameter value Description mapred-site.xml mapreduce.framework.name yarn mapreduce.jobhistory.webapp.address <ip_address>:<port> Job history server Web UI IP address and port number yarn.app.mapreduce.am.* Parameters related to application master mapreduce.map.java.opts JVM Heap size for child task of map container mapreduce.reduce.java.opts JVM Heap size for child task of reduce container mapreduce.map.memory.mb Size of container for map task mapreduce.map.cpu.vcores 1 Number of virtual cores required to run each map task mapreduce.reduce.memory.mb Size of container for reduce task mapreduce.reduce.cpu.vcores Number of virtual cores required to run each reduce task

100 Hadoop Cluster Planning (16%)
Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. Analyze the choices in selecting an OS Understand kernel tuning and disk swapping Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario Given a scenario, determine the ecosystem components your cluster needs to run in order to fulfill the SLA Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

101 Typical Hadoop Cluster
HDFS YARN HDFS YARN HDFS YARN HDFS HDFS YARN HDFS YARN HDFS YARN Network Switch(es) HDFS HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN HDFS YARN

102 Typical Hadoop Cluster
DN NM DN NM DN NM NN DN NM DN NM DN NM Network Switch(es) SNN DN NM DN NM DN NM DN NM DN NM DN NM RM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM

103 Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. Hardware (Hadoop 2.x) Different hardware for gateway/client nodes, master nodes and slave nodes Slave nodes will have both Datanodes and Nodemanagers Master nodes will have Namenode and Resourcemanager on different nodes More than 1 node for masters in production Typical Configuration: One for Namenode, one for secondary namenode and one for resourcemanager HA configuration: One for Namenode, one for standby namenode and one or more Resourcemanagers Federation configuration: More than one namenode, secondary or standby for each namenode and one or more resourcemanagers

104 Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. Slave Configuration 4x1TB or 4x2TB hard drives (just bunch of disks) with out RAID configuration. At least 2 Quad-core CPUs 24 to 32 GB RAM Gigabit Ethernet Multiples of 1 hard drive, 2 cores and 6-8 GB RAM work well for I/O bound applications Buy as many nodes as possible with few dollars while considering components based on performance More the nodes, performance will be better

105 Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. Slave Configuration Quad core or hex core Enable Hyper threading Slave nodes are typically not CPU bound Containers needs to be configured for processing the data (YARN) Each container can take up to 2 GB of RAM for Map and Reducer tasks Slaves should not use virtual memory Need to consider other Hadoop eco system tools such as HBase, Impala etc while configuring YARN More spindles will be better and more hard disks might be better 3.5 inch disks are better than 2.5 inch disks 7,200 RPM disks should be fine compared to SSD and 15,000 RPM disks 24 TB is a considerable maximum on each of the slave nodes Do not use virtualization Blade servers are not recommended

106 Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. Master Configuration Spend more money on master nodes compared to slaves Carrier class (instead of commodity hardware) unlike slaves Dual power supplies Dual Ethernet cards RAID configuration for hard drives which store FS Image and Edit logs More memory is better (depends up on how much data is stored in the cluster) *Network is covered later

107 Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster. Typically Linux based systems are used Disable SELinux Increase nofile ulimit for hadoop users such as mapred and hdfs to at least 32k Disable IPv6 Install and configure ntp daemon – to synchronize time

108 Analyze the choices in selecting an OS
Operating systems CentOS (slaves) and RHEL (masters) Fedora Core: typically used for individual workstations also can be used Ubuntu (uses Debian) SUSE (popular in Europe) Solaris (not popular in production clusters)

109 Understand kernel tuning and disk swapping
Kernel Tuning – it is important to deploy any server/database /etc/sysctl.conf Disable vm.swappiness vm.overcommit_memory (needs to be enabled for Hadoop streaming jobs) Disable ipv6 Increase ulimit parameters for users who owns hadoop daemons Disable noatime (access time need not be updated for blocks that are stored as physical files) TCP tuning And many more Disk Swapping Make sure memory is configured properly to reduce swapping between main memory and virtual memory

110 Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
You have to understand the scenario and then come up with hardware configuration

111 Given a scenario, determine the ecosystem components your cluster needs to run in order to fulfill the SLA Hadoop (HDFS and YARN) – Hadoop core components Hive – Logical database which can define tables/structure on data in HDFS and queried using SQL type syntax Pig – Data flow language which can process structured, semi-structured as well as unstructured data Sqoop – Import and export tool to copy data from relational databases to HDFS and vice versa Impala – Ad hoc querying Oozie – workflow tool Spark – in memory processing Flume – to get data from weblogs into HDFS etc

112 Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O Workload – Identify the workload on the cluster including all the applications that are part of Hadoop eco system as well as complement applications. CPU – Need to count the number of cores configured in the cluster Memory – Total memory in the cluster Storage – Amount of data that can be stored in the cluster Disk I/O – Amount of read and write operations in the cluster Cloudera displays all this information as charts in cloudera manager home page

113 Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster JBOD – Just Bunch Of Disks JBOD should be used to mount storage on to slave nodes for HDFS RAID should not be used as fault tolerance is implemented by replication factor LVM should be disabled RAID RAID configuration might be considered to store edit logs and fs image of name node. SAN (network storage) SAN might be used for a copy of edit logs and fs image but not for HDFS Virtualization Virtualization should not be used. Disk Sizing Requirements One hard drive (1-2 TB), 2 cores and 6-8 GB RAM works well for most of the configurations. Disk sizing requirements for HDFS = Size of data that needs to be stored * average replication factor If you want to store 100 TB of data with average replication factor of 3.5, then 350 TB of storage needs to be provisioned

114 Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario Network usage in Hadoop HDFS Cluster housekeeping traffic (minimal) Client metadata operations on namenode (minimal) Block data transfer (can be network intensive, eg: disk/node failure) Map Reduce Shuffle and Sort phase between mapper and reducer will use network Network design 1 Gb – Cheaper 10 Gb – expensive but performance might not benefit much for HDFS and Map Reduce (might help HBase) Fiber optics need not be necessary North/South traffic pattern East/West traffic pattern (Hadoop exhibits) Tree structure vs. Spine Fabric

115 Network Design – Tree Structure

116 Network Design – Spin Fabric

117 Hadoop Cluster Installation and Administration (25%)
Given a scenario, identify how the cluster will handle disk and machine failures Analyze a logging configuration and logging configuration file format Understand the basics of Hadoop metrics and cluster health monitoring Identify the function and purpose of available tools for cluster monitoring Be able to install all the ecoystem components in CDH 5, including (but not limited to): Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and Pig Identify the function and purpose of available tools for managing the Apache Hadoop file system

118 Given a scenario, identify how the cluster will handle disk and machine failures
Handling disk and machine failures HDFS Replication factor is used to address both disk and machine failures. In multi rack configuration, it can effectively handle network switch failures as well. Map Reduce MRv1/Classic MRv2/YARN Rack awareness (HDFS and Map Reduce)

119 Rack Awareness

120 Given a scenario, identify how the cluster will handle disk and machine failures
MapReduce v1 (MRv1) – Fault Tolerance Task Failure Failed due to bug in mapper/reducer code Bugs in JVM Hung Number of task attempts are controlled by mapred.map.max.attempts, mapred.reduce.max.attempts (default 4) If failures of a job can be ignored then use mapred.*.max.failures.percent (* => map/reduce) Speculative execution – enabled by default, multiple tasks might process same data in case of slowness due to failures related to hardware (servers, memory, network etc) Task Tracker Failure If there are no heartbeats from task tracker to job tracker for 10 minutes, then that task tracker will be removed from the pool If there are too many failures (default 4) for a task tracker, it will be blacklisted - mapred.max.tracker.blacklists If there are too many failures (default 4) for a task tracker per job, it will be blacklisted - mapred.max.tracker.failures Job Tracker Failure Job Tracker is master for scheduling all jobs. Job Tracker is single point of failure No jobs can be run

121 Given a scenario, identify how the cluster will handle disk and machine failures
MapReduce v2 (MRv2/YARN) – Fault Tolerance Task Failure (mostly same as classic/MRv1) Application Master Failure If application master is failed that means a job is failed. It can be controlled by yarn.resourcemanager.am.max.retries (default 1) Node Manager Failure If there are no heartbeats from Node Manager to Resource Manager for 10 minutes (default), then that task tracker will be removed from the pool Resource Manager Failure Although probability of Resource Manager failure is relatively low, no jobs can be submitted until RM is brought back up and running. High availability can be configured in YARN which means there will be multiple RM running in the cluster. There is no high availability in MRv1 (only one job tracker)

122 Analyze a logging configuration and logging configuration file format
Hadoop 1.x Log files are stored locally where map/reduce tasks run In some cases, it used to be tedious to troubleshoot using logs that are scattered across multiple nodes in the cluster Hadoop 2.x Provides additional features to store logs in HDFS. So if any job fails we need not go through multiple nodes to troubleshoot the issue. We can get the details from HDFS.

123 Analyze a logging configuration and logging configuration file format
Default Hadoop logs location $HADOOP_HOME/logs In hadoop.env.sh HADOOP_LOG_DIR has to be set to different value in production clusters Typically under /var/log/hadoop CDH5 stores under /var/log and directory for each of the sub process Two log files *.log and *.out Log files are rotated daily (depending up on log4j configuration) Out files are rotated on restarts. They typically store information during daemon startup and do not contain much information. Log file naming convention (hadoop-<user-running-hadoop>-<daemon>-<hostname>. {log|out} Default log level – INFO Log level can be set for any specific class with log4j.logger.class.name = LEVEL Valid log levels: FATAL, ERROR, WARN, INFO, DEBUG, TRACE HDFS Settings in log4j.properties MRv2/YARN Settings in yarn-site.xml

124 HDFS – Log Configuration
Update log4j.properties

125 MRv2/YARN – Log Configuration
File Name Parameter Name Parameter value Description yarn-site.xml yarn.log-aggregation-enable true or false – enable or disable log aggregation yarn.log-aggregation.retain-seconds 604800 yarn.nodemanager.log-dirs yarn.nodemanager.log.retain-seconds 10800 yarn.nodemanager.remote-app-log-dir yarn.nodemanager.remote-app-log-dir-suffix log4j.properties

126 Understand the basics of Hadoop metrics and cluster health monitoring
jvm dfs mapred rpc Source and sink Metrics will be collected from various sources (daemons) and pushed to sink (eg: Ganglia). Rules can be defined to filter out the metrics that are not required by the sinks. Sample hadoop-metrics2.properties # hadoop-metrics2.properties # By default, send metrics from all sources to the sink # named 'file', using the implementation class FileSink. *.sink.file.class = org.apache.hadoop.metrics2.sink.FileSink # Override the parameter 'filename' in 'file' for the namenode. namenode.sink.file.filename = namenode-metrics.log # Send the jobtracker metrics into a separate file. jobtracker.sink.file.filename = jobtracker-metrics.log

127 Understand the basics of Hadoop metrics and cluster health monitoring
Cluster Health Monitoring examples Monitoring hadoop daemons Alert if daemon goes down Monitoring disks Alert immediately if disk fails Warn when usage on disk reaches 80% Critical alert when usage on disk reaches 90% Monitoring CPU on master nodes Alert excessive CPU usage on masters Excessive CPU usage on slaves is typical Monitor swap usage on all nodes Alert if swap partition is used Monitor network transfer speeds Monitor checkpoints by secondary namenode Age of fsimage Size of edit logs

128 Understand the basics of Hadoop metrics and cluster health monitoring
Thresholds are fluid Start conservative and change them over time Avoid unnecessary alerting Alerting can be set at host level, overall, HDFS specific and Map Reduce specific Host level checks Overall Hadoop checks HDFS checks Map Reduce checks CDH5 – monitoring and alerts manager/v5-latest/Cloudera-Manager-Diagnostics- Guide/cm5dg_monitoring_settings.html

129 Identify the function and purpose of available tools for cluster monitoring
Ganglia Nagios Cacti Hyperix Zabbix Cloudera Manager Ambari Many more

130 Be able to install all the ecoystem components in CDH 5, including (but not limited to): Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and Pig Cloudera Manager Hive Impala Flume Sqoop Pig Zookeeper HBase Oozie Hue

131 Cloudera Manager

132 Hive (Architecture)

133 Hive Architecture Dependencies Daemon Processes
HDFS Map Reduce Relational DB for Metastore (MySQL or PostgreSQL) Daemon Processes No additional daemon processes on Hadoop cluster Zookeeper Hive Server Hive Metastore Configuration using CDH5 Need to configure Hive Metastore (after relational database is already installed) Validation Logs Running simple queries

134 Impala Architecture Dependencies Daemon Processes
HDFS only (does not require map reduce) Metastore (Hive) Daemon Processes Impalad Storeserver Catalogserver Configuration using CDH5 Validation Logs Running simple queries

135 Flume Architecture Dependencies Daemon Processes
HDFS Map Reduce Daemon Processes Configuration using CDH5 Validation

136 Sqoop (Architecture)

137 Sqoop Architecture

138 Sqoop2 (Architecture)

139 Sqoop Architecture Dependencies Daemon Processes
HDFS Map Reduce Daemon Processes Configuration using CDH5 Validation

140 Pig Architecture Dependencies Daemon Processes
Install pig binaries on the gateway node Dependencies HDFS Map Reduce Daemon Processes None Configuration using CDH5 Validation

141 Zookeeper Architecture Dependencies Daemon Processes
Configuration using CDH5 Validation

142 HBase (Architecture) Namenode Node Manager Datanode Resource Manager
HBase Region Servers HBase Masters Zoo Keeper

143 HBase Architecture Dependencies Daemon Processes
HDFS Zookeeper Daemon Processes Masters (at least 3) Region Servers Configuration using CDH5 Validation

144 Oozie Architecture Dependencies Daemon Processes
HDFS Map Reduce Require other components to run different workflows (for eg: Hive, Pig etc) Daemon Processes Oozie Server Configuration using CDH5 Validation

145 Hue Architecture Dependencies Daemon Processes
All Hadoop eco system tools that needs to be accessed using Hue UI Daemon Processes Configuration using CDH5 Validation

146 Identify the function and purpose of available tools for managing the Apache Hadoop file system
HDFS Federation Load balancing Namenodes HDFS HA (Active/Passive) Transparent fail over using journal nodes Namenode UI (default port: 50070) Datanode UI (default port: 50075) hadoop fs (command line utility) hdfs (command line utility)

147 Resource Management (10%)
Understand the overall design goals of each of Hadoop schedulers Given a scenario, determine how the FIFO Scheduler allocates cluster resources Given a scenario, determine how the Fair Scheduler allocates cluster resources under YARN Given a scenario, determine how the Capacity Scheduler allocates cluster resources

148 Understand the overall design goals of each of Hadoop schedulers
FIFO Scheduler – First In First Out Default scheduler Not suitable for production deployments Fair Scheduler Uses available containers as criteria Capacity Scheduler Uses available capacity as criteria

149 Given a scenario, determine how the FIFO Scheduler allocates cluster resources
FIFO – First in First out

150 Given a scenario, determine how the Fair Scheduler allocates cluster resources under YARN

151 Given a scenario, determine how the Capacity Scheduler allocates cluster resources

152 Monitoring and Logging (15%)
Understand the functions and features of Hadoop’s metric collection abilities Analyze the NameNode and JobTracker Web UIs Understand how to monitor cluster Daemons Identify and monitor CPU usage on master nodes Describe how to monitor swap and memory allocation on all nodes Identify how to view and manage Hadoop’s log files Interpret a log file

153 Understand the functions and features of Hadoop’s metric collection abilities
Hadoop Metrics jvm dfs mapred rpc Source and sink Metrics will be collected from various sources (daemons) and pushed to sink (eg: Ganglia). Rules can be defined to filter out the metrics that are not required by the sinks. Sample hadoop-metrics2.properties # hadoop-metrics2.properties # By default, send metrics from all sources to the sink # named 'file', using the implementation class FileSink. *.sink.file.class = org.apache.hadoop.metrics2.sink.FileSink # Override the parameter 'filename' in 'file' for the namenode. namenode.sink.file.filename = namenode-metrics.log # Send the jobtracker metrics into a separate file. jobtracker.sink.file.filename = jobtracker-metrics.log

154 Analyze the NameNode and JobTracker Web UIs
Namenode Web UI Architecture recap JobTracker Web UI ResourceManager Web UI Resource Manager Application Master Job History Server

155 Understand how to monitor cluster Daemons
HDFS proc_namenode proc_secondarynamenode proc_datanode Map Reduce (MRv1/classic) proc_jobtracker proc_tasktracker Map Reduce (MRv2/YARN) proc_resourcemanager proc_nodemanager ps command (ps -fu, ps -ef) Service command Cloudera Manager

156 Identify and monitor CPU usage on master nodes
Uptime Top Cpustat Cloudera Manager

157 Describe how to monitor swap and memory allocation on all nodes
Top command Cloudera Manager

158 Identify how to view and manage Hadoop’s log files
HDFS Command Line Web UI MapReduce

159 Interpret a log file Using Web UI navigate through logs and interpret the information to monitor the cluster, running jobs and troubleshoot any issues.

160 Miscellaneous Compression


Download ppt "Cloudera Certification for Apache Hadoop Admin"

Similar presentations


Ads by Google