Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Previously … (Traditional) Databases are not Swiss-Army knives Large data problems require radically different solutions Exploit the power of parallel I/O and computation MapReduce as a framework for building reliable distributed data processing applications Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)

Previously … HDFS : A reliable open source distributed file system HBase : A sorted multi-dimensional map for record oriented data – Not Relational – No query language other than map semantics (Get and Put)

MapReduce is great but … Got to write all this for a WordCount!!!

MapReduce Development cycles too long – Writing code – Packaging code JOINs on large data too hard to implement in MapReduce Today’s class: Keeping it Simple – Can we abstract users from MapReduce?

Pig Started in Fall 2007 at Yahoo! Simplify MapReduce by capturing common data processing patterns – Results in improved productivity – Lowers barrier to entry for large data processing Today: Runs 40% of Yahoo!’s large data jobs Who else: Twitter, LinkedIn, AOL, … Similar efforts elsewhere: Sawzall (Google), Hive (Facebook)

Pig = Query Language + Interpreter Language: Pig Latin – A data flow language LOAD, STORE, FILTER, ORDER, GROUP, JOIN Interpreter: Grunt – An execution environment to convert Pig Latin to MapReduce Two modes – Local : JVM – Distributed: via Hadoop

Pig Latin Example from Pittsburg Hadoop Users Group

Equivalent MapReduce code

Pig Latin from an Example Find users who visit “good” pages (Example courtesy: Yahoo! Research)

Conceptual Dataflow

Pig Latin script

Pig Latin: The Language Structure – Collection of STATEMENTS – Statement has an OPERATOR and ends in ‘;’

Summary of Pig Latin Operators CategoryOperator Loading and StoringLOAD STORE DUMP FilteringFILTER DISTINCT FOREACH … GENERATE STREAM Grouping and JoiningJOIN COGROUP CROSS SortingORDER LIMIT Combining and SplittingUNION SPLIT

LOAD/STORE and Schemas grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> records = LOAD ‘input/sample.txt’; grunt> STORE records INTO ‘output/sample.out`;

FILTER grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> bad_records = FILTER records BY quality < 0; grunt> bad_years = FOREACH bad_records GENERATE year;

STREAM grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> projected = FOREACH records GENERATE $0, $2; grunt> projected = STREAM records THROUGH `cut -f0,2`

JOIN grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> sales = LOAD ‘input/sales.txt’ >> AS (year:int, profit:float); grunt> combined = JOIN records BY year, sales BY year; grunt> profit_year = FOREACH combined GENERATE profit, year;

GROUP grunt> combined = GROUP records BY quality; grunt> combined = GROUP sales BY quality < AVG(quality); grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int);

ORDER grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> combined = ORDER records BY year, quality DESC;

Parallelism grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality PARALLEL 50; Can use PARALLEL keyword in any statement

User Defined Functions Unlike SQL, can invoke custom defined functions in query – Proprietary solutions like PL/SQL allow that grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> REGISTER mypackage.jar; grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc(); grunt> combined = GROUP records BY MyFunc(quality);

PIG LATIN Review CategoryOperator Loading and StoringLOAD STORE DUMP FilteringFILTER DISTINCT FOREACH … GENERATE STREAM Grouping and JoiningJOIN COGROUP CROSS SortingORDER LIMIT Combining and SplittingUNION SPLIT

Revisiting WordCount grunt> sentences = LOAD ‘input/*.txt’ >> USING TextLoader() AS (sentence: chararray); grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word; grunt> word_kinds = GROUP words BY word; grunt> word_count = FOREACH word_kinds >> GENERATE group, COUNT(words) grunt> STORE word_count INTO ‘output/wordcount’;

No more this …

Related Project: Hive Started in Facebook, now open source Like PIG but supports SQL Trend : Move towards in-database MapReduce Allows existing DB applications to scale up Makes MapReduce capabilities easily accessible Business opportunity: www.vertica.com

Summary (this and last class) MapReduce as a radically different solution to large data problems Exploit the power of parallel I/O and computation Need to think from the “ground up” – Filesystem: HDFS – Table store: HBase Basic MapReduce too complicated DB end users

Summary (this and last class) Efforts to simplify MapReduce based data processing PIG from Yahoo! Pig Latin a-not-so-SQL like language – A data flow language LOAD, STORE, FILTER, ORDER, GROUP, JOIN Facebook Hive supports direct SQL interface Emerging trend: Fusion of MapReduce and DB technologies

Happy Thanksgiving!

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09

Similar presentations

Presentation on theme: "Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09

Similar presentations

Presentation on theme: "Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09"— Presentation transcript:

Similar presentations

About project

Feedback