Presentation is loading. Please wait.

Presentation is loading. Please wait.

Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger Namrata Patil.

Similar presentations


Presentation on theme: "Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger Namrata Patil."— Presentation transcript:

1 Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger Namrata Patil

2 Contents….. Introduction Sub projects of Hadoop Two solutions for data acquisition Workflow of Chukwa system Primary components Setup for Performance Analysis Factors Influencing Performance Comparison Conclusion 2Department of computer science & Engg

3 Introduction Oil and Gas Industry Drilling done from service companies 3Department of computer science & Engg

4 Continued… Companies collect drilling data by placing sensors on drilling bits and platforms and make it available on their servers. Advantages Problems Drilling status Operators can get useful information on the historical data Vast amounts of data are accumulated Infeasible or very time consuming to perform reasoning over it Investigate application of MapReduce system Hadoop 4Department of computer science & Engg Solution

5 Sub projects of hadoop 1. Hadoop Common 2. Chukwa 3. Hbase 4. HDFS HDFS - Distributed File System stores application data in a replicated way high throughput Chukwa - An open source data collection system designed for monitoring large distributed system. 5Department of computer science & Engg

6 Two solutions for data acquisition.. Solution 1 Acquiring data from data sources, and then copying the data file to HDFS Solution 2 Chukwa based Solution Solution 1 Acquiring data from data sources, and then copying the data file to HDFS Solution 2 Chukwa based Solution Department of computer science & Engg6

7 Solution 1 Hadoop runs MapReduce jobs on the cluster Stores the results on HDFS Steps Prepare the required data set for the job Copy it to HDFS Submit the job to hadoop Store the result in a directory specified by user on HDFS. Get the result out of HDFS Department of computer science & Engg7

8 Pros & Cons… Pros… Works efficiently for small number of files with large file size Cons… Takes a lot of extra time for large number of files with small file size Does not support appending file content Department of computer science & Engg8

9 Solution 2 Overcome the problem of extra time generated by copying large file to HDFS Exists on top of Hadoop Chukwa feeds the organized data into cluster Uses temporary file to store the data collected from different agents. Department of computer science & Engg9

10 Chukwa Open source data collection system built on top of Hadoop. Inherits scalability and robustness Provides flexible and powerful toolkit to display, monitor, and analyze results 10Department of computer science & Engg

11 Workflow of Chukwa system 11Department of computer science & Engg

12 Primary components….. Agents - run on each machine and emit data. Collectors - receive data from the agent and write it to stable storage. MapReduce jobs - parsing and archiving the data. HICC - Hadoop Infrastructure Care Center a web-portal style interface for displaying data. 12Department of computer science & Engg

13 Continued… Agents Collecting data through their adaptors. Adaptors - small dynamically-controllable modules that run inside the agent process Several adaptors Agents run on every node of hadoop cluster Data from different hosts may generate different data. 13Department of computer science & Engg

14 Collectors Gather the data through HTTP Receives data from up to several hundred agents Writes all this data to a single hadoop sequence file called sink file close their sink files, rename them to mark them available for processing, and resume writing a new file. Advantages Reduce the number of HDFS files generated by Chukwa Hide the details of the HDFS file system in use, such as its Hadoop version, from the adaptors 14Department of computer science & Engg

15 MapReduce processing Aim organizing and processing incoming data MapReduce jobs Archiving - take chunks from their input, and output new sequence files of chunks, ordered and grouped Demux - take chunks as input and parse them to produce ChukwaRecords ( key – value pair) 15Department of computer science & Engg

16 HICC - Hadoop Infrastructure Care Center Web interface for displaying data Fetches the data from MySQL database Easier to monitor data 16Department of computer science & Engg

17 Setup for Performance Analysis Hadoop cluster that consists of 15 unix hosts that existed at the unix lab of UIS One tagged with name node and the others are used as data nodes. Data stored at data nodes in replicated way Department of computer science & Engg17

18 Factors Influencing Performance Comparison 1. Quality of the Data Acquired in Different Ways 2. Time Used for Data Acquisition for Small Data Size 3. Data Copying to HDFS for Big Data Size. 18Department of computer science & Engg

19 Quality of the Data Acquired in Different Ways 19Department of computer science & Engg The Size of Data Acquired by Time Sink file size = 1Gb Chukwa agent check the file content every 2 seconds

20 Time Used for Data Acquisition for Small Data Size Department of computer science & Engg20 Actual Time Used for Acquisition in a Certain Time Time used to acquire data from servers Put acquired data into HDFS

21 Data Copying to HDFS for Big Data Size. Department of computer science & Engg21 Time Used to Copy Data Set to HDFS With Diferent Replica Number Slope of line is bigger when replica number is bigger

22 Critical Value of Generating Time Differences Size of Data set Time Used 20M2s 30M3s 40M3s 50M8s Department of computer science & Engg22 Time used for copying according to the size of data set with replica number of 2 Critical value Corresponding size of data file for generating time difference of data acquisition

23 Continued… Size of Data set Time Used 10M2s 15M2s 20M8s 30M10s 40M21s Department of computer science & Engg23 Time used for copying according to the size of data set with replica number of 3

24 Conclusion….. Department of computer science & Engg24 Chukwa was demonstrated to work more efficiently for big data size, while for small data size there was no difference between the solutions

25 Department of computer science & Engg25


Download ppt "Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger Namrata Patil."

Similar presentations


Ads by Google