Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.

Similar presentations


Presentation on theme: "Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data."— Presentation transcript:

1 Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data

2 Hardware Considerations Original Design Idea: Hadoop infrastructure is designed and developed to run on a normal commodity hardware. Actual Production Environment: Large production Hadoop clusters do require a proper hardware infrastructure planning to have a minimal latency in storing and processing of large volume of data. Hardware architecture should be carefully designed based on the nature of data and the jobs that will be run and the agreed SLA ©2013 OpalSoft Big Data

3 Hardware Considerations Various aspects of hadoop hardware infrastructure Servers (Named Node, Job Tracker, Data Node) Racks Network Switch Storage Backup Number of copies of data ©2013 OpalSoft Big Data

4 Name Node & Job Tracker Server RAM size should be sized based on the following – Number of data nodes in cluster – Approximate number of blocks that will be stored in the cluster – Number of different hadoop process that is run the machine ©2013 OpalSoft Big Data

5 I/O Adapters – Not a critical element as named node does not participate in data transfers. Processor – Minimally need multi core processors Standby Node Server – Should be as same capacity as primary named node ©2013 OpalSoft Big Data

6 Data Node Servers RAM size should be sized based on the following – Approximate number of blocks that will be stored in the cluster – Number of different hadoop process that is run the machine. ©2013 OpalSoft Big Data

7 I/O Adapters – High throughput I/O adapter is needed Processor – Need multi core, multi processors for parallel execution of more than one map reduce job Virtualization is not recommended ©2013 OpalSoft Big Data

8 Racks Hadoop is rack aware Configure hadoop with node’s rack information Servers should be distributed at least across two racks to prevent any data loss due to rack failure ©2013 OpalSoft Big Data

9 Hadoop automatically performs block replication across servers in two different racks Servers located in same rack has low latency of data transfer because, all data transfer occurs via rack’s network switch ©2013 OpalSoft Big Data

10 Network Switch Its recommended to have a separate private network for hadoop cluster. Both core and rack network switch should support high bandwidth duplex data transfer Higher capacity core and rack switch will be required if the number of data copies are more than the standard 3 copies. ©2013 OpalSoft Big Data

11 Storage Locally attached storage provide better performance than a NFS or SAN storage Hard disk with higher RPM provides better read/write throughput More number of smaller capacity hard disk should be used instead of a single large capacity disk will allow concurrent read/writes and reduces disk level bottle neck ©2013 OpalSoft Big Data

12 Named Node and Job Tracker servers should have raid configurations which is highly fault tolerant Data Node raid configuration is not as critical. Data is already replicated across multiple servers Using SSD will improve performance drastically at the expense of higher setup cost ©2013 OpalSoft Big Data

13 Backup Named Node data is the most critical information that needs more frequent backup Named node data is regularly streamed to a standby node to restore hadoop cluster operation in case of primary named node failure. ©2013 OpalSoft Big Data

14 Another backup server is recommended to perform periodic backup and checksum verification of named node data Data node backup is requirement depends upon the criticality and availability of data. Doesn’t require frequent backup though a regular back up is needed only to recover from a data center level failures. ©2013 OpalSoft Big Data

15 Number of Copies of Data Following determines the number of copies of a block – Criticality of data – Number of Concurrent map reduce jobs that will be executed in the data set – More replicas of data allows more number of jobs to be run concurrently on the same data set. – Reduces job execution time as most often all the required data is either available locally or at least in the same rack ©2013 OpalSoft Big Data

16 – Be aware, more number of replicas affects the write performance of hadoop cluster. – Having more replicas only for most frequently used data provides maximum benefit instead of having a general replication factor across the cluster. ©2013 OpalSoft Big Data


Download ppt "Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data."

Similar presentations


Ads by Google