Pig Latin - A Not-So-Foreign Language for Data Processing

Pig Latin - A Not-So-Foreign Language for Data Processing
YING YU Sep 19th 2016

Background what's the problem? Looking for trends →Product improvement
When the data sets are growing to be extremely large, how can we make the ad-hoc data analysis tasks more efficiently? Looking for trends →Product improvement →Advertisement

Previous Solution-Analysis of Large Data Sets
parallel database products: Teradata, Oracle RAC, etc. Problem: Expensive at web scale SQL:Unnatural, restrictive to write MapReduce Model Code: difficult to reuse ad maintain map and reduce functions : opaque for the optimization and check user's code

Yahoo!'s New Solution: Pig Latin
combine the best of SQL and MapReduce: A sequence of steps , each step carries out a single fairly high-level data transformation high level declarative querying low level procedural programming

Why Pig Latin? Quick start and interoperability
no need to import data before queries user chooses the output operate data in external files 2. Support nested data model flexible, more natural for us to think operate on files, reduce time and space algebraic language e.g: To store information of documents Nested data model : Map<documentId,Set<positions>> Databases: term_info: (termId, termString, ...) position_info: (termId, documentId, position)

Why Pig Latin? 3. Support UDFs-Extensibility
for specialized data processing top10()is a UDF In SQL, different clauses can only use the correspondent functions 4. A novel debugging environment detect errors early and pinpoint the error step. groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls); Select url from table where catagory='a'

Pig Latin-data model type definition fetch data atom
a string or a number 'a' tuple sequence of fields ('a','b') use tuple.id, tuple.(id0, id1) or $0 bag collection of tuples {('a','b'),('a',('c','d'))} use bag.id, bag.(id0, id1) or $0 map collection of data items map#’key’ [‘fan of’ →{(‘lakers’)(‘iPod’)} ‘age’ → 20 ] use $0#’key’ or field_name#’key’ field by position or field by name

1. Load: to specify input data
Commands 1. Load: to specify input data 2. Foreach: to do the independent processing to every tuple, “flatten” is used to eliminate the nesting. queries = LOAD ‘query_log.txt’ USING myLoad() AS (userId, queryString, timestamp); expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString)); myload ()convert the file into tuples

3. filter : to get the target subset data
Commands 3. filter : to get the target subset data 4. cogroup: to get related data together group, join,map-reduce real_queries = FILTER queries BY NOT isBot(userId); SELECT queryresult FROM table WHERE userId !=”a ” grouped_data = COGROUP results BY queryString, revenue BY queryString;

5. Other SQL similar commands: 6.store: store nested data
union, cross, order and distinct 6.store: store nested data 7.nested functions allow FILTER, ORDER, and DISTINCT to be nested within FOREACH STORE query_revenues INTO 'myoutput' USING myStore(); ask for result to be materialized to a file

Execution platform: Hadoop Build a logical plan:
Implementation Execution platform: Hadoop Build a logical plan: iterative independent Try to avoid the large nested bags Q: What to do when some operations after the cogroup is not algebraic UDF?

e.g: Logical Plan Construction of logical plan
The logical plan of “group_data” includes: a cogroup command logical plans for two bags, results and revenue Interpreter to parse the commands Compile When the store command is invoked on a bag: logical plan→physical plan grouped_data = COGROUP results BY queryString, revenue BY queryString;

Map-Reduce Plan convert each COGROUP command in the logical plan into a distinct map reduce job Q: What if there exists more than input data set in a case of a COGROUP command? map functions appends an extra field to each tuple that identifies where the tuple is from map and reduce instances are run in parallel e.g. from Programming Pig

Debugging environment- Pig Pen
sandbox data set(Realism,Conciseness,Completeness) help understand the schema Q:How to generate the dataset that fulfills the three main objectives

Data sets are too big and continuous to be accumulated
use scenarios Rollup aggregates: Data sets are too big and continuous to be accumulated Do the successive aggregations or other commands directly on the files Temporal analysis To study the pattern of data change data in different periods (COGROUP, UDFs) Session analysis Analysis of the user activities during a session use the nested data model represents and manipulates sessions.(ORDER BY)

Future Optimizations Database style optimization
Secure optimization without influence on semantics UI support other languages unified environment

Q: Does Pig have the transaction handling and index ?
Objectives of Pig Latin What are the main differences between PIG and the RDBMS? Q: Does Pig have the transaction handling and index ? A: No How will the pig latin support other script programming languages such as Perl or Python?

Pig Latin - A Not-So-Foreign Language for Data Processing

Similar presentations

Presentation on theme: "Pig Latin - A Not-So-Foreign Language for Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pig Latin - A Not-So-Foreign Language for Data Processing

Similar presentations

Presentation on theme: "Pig Latin - A Not-So-Foreign Language for Data Processing"— Presentation transcript:

Similar presentations

About project

Feedback