Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Slides:



Advertisements
Similar presentations
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
Cloud Computing Other Mapreduce issues Keke Chen.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Introduction to Hadoop and HDFS
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Restore : Reusing results of mapreduce jobs Jun Fan.
An Introduction to HDInsight June 27 th,
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Data Cleansing with Pig Latin. Neubot Tests Data Structure.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce Compilers-Apache Pig
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MSBIC Hadoop Series Processing Data with Pig
Pig : Building High-Level Dataflows over Map-Reduce
Big Data Analytics: HW#3
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Overview of big data tools
Pig : Building High-Level Dataflows over Map-Reduce
CSE 491/891 Lecture 21 (Pig).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
(Hadoop) Pig Dataflow Language
5/7/2019 Map Reduce Map reduce.
(Hadoop) Pig Dataflow Language
Pig and pig latin: An Introduction
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

Objective  Understand how Pig Latin works  Implement an IP Address filter using Apache Pig  Implement a similar IP filter using purely Hadoop  Comparison & Analysis of the two implementations  Conduct a case study of pros and cons of other high-level languages with Pig

What is Apache Pig? Platform for analyzing large data sets. Merging data sets, filtering them, and applying functions to records or groups of records Allows you to create user defined functions

Pig Infrastructure Mainly consists of two layers, Compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin

Nested Data Model Pig Latin has a fully-nestable data model with: Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Computers, Desktops Laptops Netbooks

Pig Latin vs. SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL Ease of programming Optimization opportunities Extensibility Pig Latin

JOIN vs. COGROUP

Using Pig on cloud Pig Latin programs run in a distributed fashion on a cluster Programs are complied into Map/Reduce jobs and executed using Hadoop Pig Latin programs can also run in "local mode" without a cluster

Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

Map-Reduce on Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary

Implementation cluster Hadoop Map-Reduce Hadoop Map-Reduce Pig SQL automatic rewrite + optimize user

IP Filtering  Internet companies swimming in data  Analyzing of huge data is needed to filter out BOT IP’s  A High level language in a cloud environment would be useful to filter out these IP’s efficiently

Log Used 2 Months worth of all Http request to NASA Kennedy Space center

Data Flow Load Logs Group by ip Foreach ip Generate count Foreach ip Generate count Load Log 2 Join on ip generate the top ip’s ORDER BY count Filter ip based on Threshold

EXAMPLE A = LOAD 'input/*' USING PigStorage('\t') AS (ip:chararray); B = GROUP A by ip; C = FOREACH B GENERATE FLATTEN(group),COUNT(A.ip) as count; D = ORDER C BY count; E = FILTER D BY $1>500; F = STORE E INTO 'result'; Lines of Code : 6

IP Filtering - Pure Map Reduce Map Reduce Filter IP’s from Log files Compute occurrence of IP’s Sort IP’s based on count Compute Cumulative frequency Filter IP’s above threshold Lines of Code : 130

PERFORMANCE ANALYSIS

DEMO

Pig Vs Hive

PROS AND CONS PROS Allows UDF Easy to scale large data Simple user understandable language CONS Does not allow JDBC/ODBC No server

Time Line MilestoneSchedule Understand how Pig Latin works Read through the tutorial 11/07/2011 Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations 11/14/2011 Implement IP filter using purely Hadoop and compare it to the Pig implementation 11/28/2011 Conduct a case study on the pros and cons of high level languages12/05/2011 Final Report12/12/2011

References  A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008);  

Thank you