Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop 2012.11.16 오지영. Hadoop(1) Distributed File System Processing and management of large amounts of data Lengthy complex computational tasks Multiple.

Similar presentations


Presentation on theme: "Hadoop 2012.11.16 오지영. Hadoop(1) Distributed File System Processing and management of large amounts of data Lengthy complex computational tasks Multiple."— Presentation transcript:

1 Hadoop 오지영

2 Hadoop(1) Distributed File System Processing and management of large amounts of data Lengthy complex computational tasks Multiple computers connected to the network Hadoop It is based on the GFS(Goole File System) and Maprudece Developer : Doug Cutting a part of the Nutch(open source search engine)

3 Hadoop(2) Major components HDFS(Hadoop Distributed File System) Master : Name node -> metafile management Slave : Data node -> large amounts of data Data : default 64MB block MapReduce Master : JobTracker -> monitoring of TaskTracker Slave : TaskTracker -> parallel processing of data

4 Hadoop(3)

5 Hadoop(4) Characteristics Scale-out system 전체 가용량 및 성능이 선형적으로 증가 Easy to add and remove of node 주기적인 통신으로 상태를 감시 -> 새로운 장비의 추가 및 삭제가 자유로움 High availability 일부 장비에 장애 발생 -> 전체 시스템 영향 없음 Single Point Of Failure(SPOF) JobTracker 와 Name node 의 장애에 취약

6 HDFS HDFS(Hadoop Distributed File System) Similar to the general File System Directory structure of tree type Replication, delete, move of directory or files Purpose of HDFS Maintaining High availability for large amounts of data Special design for management of data

7 HDFS characteristic(1) Data Block Name node Basic information of each files(name, directory, permission…) Management of location information of blocks Data node Store partitioned block of the file Replication Prevent loss of data Several replicas for each data block(default three replicas) Heart beat : Packets exchanged between Name node and Data node Reck Awareness When a problem occur on Reck or center for enhanced availability management Data block : based on Topological structure of a whole cluster

8 HDFS characteristic(2) Data Read(Locality)

9 HDFS characteristic(3) Data Write(Consistency)

10 MapReduce(1) MapReduce Processing of list data at functional language Map function -> map task(hadoop) Process Common tasks at each element of list Key/value extracted from the data reduce function -> reduce task(hadoop) Combination of all the elements of the list Results one output

11 MapReduce structure and processing mode(1) Job and Task Job unit Management of job requested by the user Distributed processing at several node Task Detail job unit processing at each node Number of Map task : depends on the size of the input data Split : processing data unit Number of Reduce task : directly configuration by user

12 MapReduce structure and processing mode(2) JobTracker and TaskTracker JobTracker Master of MapReduce system Only one execution at the entire system Job(requested by the user) insert into queue Current status monitoring of TaskTracker Assign the task to TaskTracker, if you need to perform the task TaskTracker Role to manage and execute the assigned task Run one taskTracker at each equipment Perform multiple task in each taskTracker Number of task : specify hadoop configuration file by user

13 MapReduce structure and processing mode(3)

14 MapReduce structure and processing mode(4) MapTask and ReduceTask MapTask Role to perform job for record unit ReduceTask Merging and processing related data Classify of data mapTask -> Data classify -> ReduceTask generate key/value pair data classified according to same key send data on reduceTask

15 MapReduce structure and processing mode(5)

16 Job Flow

17 MapReduce policy and function(1) Task allocation policy JobTracker Assign a map task to TaskTracker Assign to handle block data within the TaskTracker node(Data node) ( 네트워크를 통해 데이터를 가지고 오지 않으므로 빠른 작업 ) Management of ‘Split file’ information The location of the node that contains the data Assigning the task to the nearest data Reduce task allocation Manage list of performed job reduce task partitioning multiple TaskTracker Black list Manage a node often task fails

18 MapReduce policy and function(2) Combiner Map task : locality Reduce task : non locality If result of map task is big size, costs of network increased. Combiner is located between Mapper and Reducer local reducer Reduction of network costs Results of Map task -> classified according to the key -> perform reduce task ->send the result on reducer

19

20 MapReduce policy and function(3) Incremental Shuffling Shuffle Passing the resulting value from map task to reduce task One of the jobs with a large cost Solution To reduce the amount of time JobTracker Pre-assign reduce task Reduce task If map task was finished work, perform the shuffle process of that map task

21 MapReduce policy and function(4) Speculative Execution Occur a problem on equipment that it is performed special task The task to slow down or fail Occur a problem that entire job is delayed Solution jobTracker is monitoring TaskTracker continuously Assigned slow task to other TaskTrackers

22 Closing post Hadoop is currently the most widely used open source distributed system It has processed large amounts job through the APIs called MapReduce

23 Disadvantages and limitations of hadoop(1) Scale-out is not Scale-up(scale vertically) Heighten the performance of the system by increasing the performance of the equipment scale-out(scale horizontally) Heighten the performance of the system by increasing the number of equipment There are constraints on the performance of the Name node and JobTracker If an increase in the number of Data node, memory is increase

24 Disadvantages and limitations of hadoop(2) cheap is not(1) Hadoop is system with high performance and stability using a general purpose computer General purpose computer is not low-end computer Job Tracker or Name node Memory should be greater than TaskTracker or Data node Storage is rarely needed

25 Disadvantages and limitations of hadoop(3) cheap is not(2) Hadoop:the Definitive Guide Processor : 2X quad-core Intel Xeon 2.0 GHz CPUs Memory : 8GB ECC RAM Storage : 4TB SATA disks Network : Gigabit Ethernet

26 Disadvantages and limitations of hadoop(4) Stability problems System is not completely safe Hadoop is vulnerable to the disorder of Name node or JobTracker When a node fails, the performance of the entire system is a temporary degraded May cause a domino effect or ping-pong phenomenon

27 Disadvantages and limitations of hadoop(5) Resource management If you do not manage data, gives a burden to the master If you do not care to optimize, the computing power is wasted Need to understand the internal logic and system

28 Disadvantages and limitations of hadoop(6) Monitoring or management Provide monitoring web page and command Provide statistical functions Cheap operating costs on job management Almost doesn’t affect the operation errors

29 Disadvantages and limitations of hadoop(7) Many MapReduce jobs that do not belong Purpose of MapReduce Suitable for batch jobs of large amounts of data 검색 서비스에서 대용량의 웹 문서에 대한 인덱스 작업 Complex tasks, the efficiency is reduced Real-time data analysis such as job of streaming Large scale input data, many map/reduce step Solution DB or NoSQL system

30 Disadvantages and limitations of hadoop(8) MapReduce of Hadoop has become a universal platform for data analysis(1) 하둡을 활용한 다양한 라이브러리 및 시스템 개발 Library and system based on MapReduce Cascading MapReduce wrapping library Pig(yahoo) Sub project of hadoop Type of script language. Development of MapReduce program Speed programming and feedback

31 Disadvantages and limitations of hadoop(9) MapReduce of Hadoop has become a universal platform for data analysis(2) Library and system based on MapReduce Hive(facebook) System for MapReduce programming Provide a similar syntax of SQL for data analyst Mahout Library for MapReduce programming model Algorithm such as data mining and machine learning is applied to large amounts of data


Download ppt "Hadoop 2012.11.16 오지영. Hadoop(1) Distributed File System Processing and management of large amounts of data Lengthy complex computational tasks Multiple."

Similar presentations


Ads by Google