Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Similar presentations


Presentation on theme: "Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel."— Presentation transcript:

1 Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

2 Objective  Understand how Pig Latin works  Implement an IP Address filter using Apache Pig  Implement a similar IP filter using purely Hadoop  Comparison & Analysis of the two implementations  Conduct a case study of pros and cons of other high-level languages with Pig

3 What is Apache Pig? Platform for analyzing large data sets. Merging data sets, filtering them, and applying functions to records or groups of records Allows you to create user defined functions

4 Pig Infrastructure Mainly consists of two layers, Compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin

5 Nested Data Model Pig Latin has a fully-nestable data model with: Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Computers, Desktops Laptops Netbooks

6 Pig Latin vs. SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL Ease of programming Optimization opportunities Extensibility Pig Latin

7 JOIN vs. COGROUP

8 Using Pig on cloud Pig Latin programs run in a distributed fashion on a cluster Programs are complied into Map/Reduce jobs and executed using Hadoop Pig Latin programs can also run in "local mode" without a cluster

9 Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

10 Map-Reduce on Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary

11 Implementation cluster Hadoop Map-Reduce Hadoop Map-Reduce Pig SQL automatic rewrite + optimize user

12 IP Filtering  Internet companies swimming in data  Analyzing of huge data is needed to filter out BOT IP’s  A High level language in a cloud environment would be useful to filter out these IP’s efficiently

13 Log Used 2 Months worth of all Http request to NASA Kennedy Space center

14 Data Flow Load Logs Group by ip Foreach ip Generate count Foreach ip Generate count Load Log 2 Join on ip generate the top ip’s ORDER BY count Filter ip based on Threshold

15 EXAMPLE A = LOAD 'input/*' USING PigStorage('\t') AS (ip:chararray); B = GROUP A by ip; C = FOREACH B GENERATE FLATTEN(group),COUNT(A.ip) as count; D = ORDER C BY count; E = FILTER D BY $1>500; F = STORE E INTO 'result'; Lines of Code : 6

16 IP Filtering - Pure Map Reduce Map Reduce Filter IP’s from Log files Compute occurrence of IP’s Sort IP’s based on count Compute Cumulative frequency Filter IP’s above threshold Lines of Code : 130

17 PERFORMANCE ANALYSIS

18

19

20

21

22

23 DEMO

24 Pig Vs Hive

25 PROS AND CONS PROS Allows UDF Easy to scale large data Simple user understandable language CONS Does not allow JDBC/ODBC No server

26 Time Line MilestoneSchedule Understand how Pig Latin works Read through the tutorial 11/07/2011 Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations 11/14/2011 Implement IP filter using purely Hadoop and compare it to the Pig implementation 11/28/2011 Conduct a case study on the pros and cons of high level languages12/05/2011 Final Report12/12/2011

27 References  A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008);  https://cwiki.apache.org/confluence/display/PIG/Index https://cwiki.apache.org/confluence/display/PIG/Index  http://pig.apache.org/ http://pig.apache.org/

28 Thank you


Download ppt "Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel."

Similar presentations


Ads by Google