Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Cleansing with Pig Latin. Neubot Tests Data Structure.

Similar presentations


Presentation on theme: "Data Cleansing with Pig Latin. Neubot Tests Data Structure."— Presentation transcript:

1 Data Cleansing with Pig Latin

2 Neubot Tests

3 Data Structure

4 Example

5 Hadoop Ecosystem

6 MapReduce Source: http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pighttp://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

7 Pig Latin  High level data flow language for exploring very large datasets.  Provides an engine for executing data flows in parallel on Hadoop.  Compiler that produces sequences of MapReduce programs  Structure is amenable to substantial parallelization  Operates on files in HDFS  Metadata not required, but used when available  Key Properties of Pig:  Ease of programming: Trivial to achieve parallel execution of simple and parallel data analysis tasks  Optimization opportunities: Allows the user to focus on semantics rather than efficiency  Extensibility: Users can create their own functions to do special- purpose processing

8 Why Pig?

9 Equivalent Java MapReduce Code

10 Filter by Age Load UsersLoad Pages Join on Name Group on url Count Clicks Order by Clicks Take Top 5 Save results

11 Pig Datatypes Simple Type int long float double chararray bytearray boolean TypeDescription tupleAn ordered set of fields. bagAn collection of tuples. mapA set of key value pairs.

12 Pig Simple Datatypes Simple TypeDescription Example intSigned 32-bit integer10 longSigned 64-bit integerData: 10L or 10l Display: 10L float32-bit floating pointData: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F double64-bit floating pointData: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 chararrayCharacter array (string) in Unicode UTF-8 format hello world bytearrayByte array (blob) boolean true/false (case insensitive)

13 Pig Commands StatementDescription LoadRead data from the file system StoreWrite data to the file system DumpWrite output to stdout ForeachApply expression to each record and generate one or more records FilterApply predicate to each record and remove records where false Group / CogroupCollect records with the same key from one or more inputs JoinJoin two or more inputs based on a key OrderSort records based on a Key DistinctRemove duplicate records UnionMerge two datasets LimitLimit the number of records SplitSplit data into 2 or more sets, based on filter conditions

14 Pig Diagnostic Operators StatementDescription DescribeReturns the schema of the relation DumpDumps the results to the screen ExplainDisplays execution plans. IllustrateDisplays a step-by-step execution of a sequence of statements

15 Parser (PigLatin  LogicalPlan) Optimizer (LogicalPlan  LogicalPlan) Compiler (LogicalPlan  PhysicalPlan  MapReducePlan) ExecutionEngine PigContext Hadoop Grunt (Interactive shell) PigServer (Java API) Architecture of Pig

16 Pig Latin vs SQL

17 RapidMiner


Download ppt "Data Cleansing with Pig Latin. Neubot Tests Data Structure."

Similar presentations


Ads by Google