Presentation is loading. Please wait.

Presentation is loading. Please wait.

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09

Similar presentations


Presentation on theme: "Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09"— Presentation transcript:

1 Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

2 Previously … (Traditional) Databases are not Swiss-Army knives Large data problems require radically different solutions Exploit the power of parallel I/O and computation MapReduce as a framework for building reliable distributed data processing applications Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)

3 Previously … HDFS : A reliable open source distributed file system HBase : A sorted multi-dimensional map for record oriented data – Not Relational – No query language other than map semantics (Get and Put)

4 MapReduce is great but … Got to write all this for a WordCount!!!

5 MapReduce Development cycles too long – Writing code – Packaging code JOINs on large data too hard to implement in MapReduce Today’s class: Keeping it Simple – Can we abstract users from MapReduce?

6 Pig Started in Fall 2007 at Yahoo! Simplify MapReduce by capturing common data processing patterns – Results in improved productivity – Lowers barrier to entry for large data processing Today: Runs 40% of Yahoo!’s large data jobs Who else: Twitter, LinkedIn, AOL, … Similar efforts elsewhere: Sawzall (Google), Hive (Facebook)

7 Pig = Query Language + Interpreter Language: Pig Latin – A data flow language LOAD, STORE, FILTER, ORDER, GROUP, JOIN Interpreter: Grunt – An execution environment to convert Pig Latin to MapReduce Two modes – Local : JVM – Distributed: via Hadoop

8 Pig Latin Example from Pittsburg Hadoop Users Group

9 Equivalent MapReduce code

10 Pig Latin from an Example Find users who visit “good” pages (Example courtesy: Yahoo! Research)

11 Conceptual Dataflow

12 Pig Latin script

13 Pig Latin: The Language Structure – Collection of STATEMENTS – Statement has an OPERATOR and ends in ‘;’

14 Summary of Pig Latin Operators CategoryOperator Loading and StoringLOAD STORE DUMP FilteringFILTER DISTINCT FOREACH … GENERATE STREAM Grouping and JoiningJOIN COGROUP CROSS SortingORDER LIMIT Combining and SplittingUNION SPLIT

15 LOAD/STORE and Schemas grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> records = LOAD ‘input/sample.txt’; grunt> STORE records INTO ‘output/sample.out`;

16 FILTER grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> bad_records = FILTER records BY quality < 0; grunt> bad_years = FOREACH bad_records GENERATE year;

17 STREAM grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> projected = FOREACH records GENERATE $0, $2; grunt> projected = STREAM records THROUGH `cut -f0,2`

18 JOIN grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> sales = LOAD ‘input/sales.txt’ >> AS (year:int, profit:float); grunt> combined = JOIN records BY year, sales BY year; grunt> profit_year = FOREACH combined GENERATE profit, year;

19 GROUP grunt> combined = GROUP records BY quality; grunt> combined = GROUP sales BY quality < AVG(quality); grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int);

20 ORDER grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> combined = ORDER records BY year, quality DESC;

21 Parallelism grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality PARALLEL 50; Can use PARALLEL keyword in any statement

22 User Defined Functions Unlike SQL, can invoke custom defined functions in query – Proprietary solutions like PL/SQL allow that grunt> records = LOAD ‘input/sample.txt’ >> AS (year:int, temprature:int, quality:int); grunt> REGISTER mypackage.jar; grunt> DEFINE MyFunc mypackage.MyFuncImpl.myFunc(); grunt> combined = GROUP records BY MyFunc(quality);

23 PIG LATIN Review CategoryOperator Loading and StoringLOAD STORE DUMP FilteringFILTER DISTINCT FOREACH … GENERATE STREAM Grouping and JoiningJOIN COGROUP CROSS SortingORDER LIMIT Combining and SplittingUNION SPLIT

24 Revisiting WordCount grunt> sentences = LOAD ‘input/*.txt’ >> USING TextLoader() AS (sentence: chararray); grunt> words = FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word; grunt> word_kinds = GROUP words BY word; grunt> word_count = FOREACH word_kinds >> GENERATE group, COUNT(words) grunt> STORE word_count INTO ‘output/wordcount’;

25 No more this …

26 Related Project: Hive Started in Facebook, now open source Like PIG but supports SQL Trend : Move towards in-database MapReduce Allows existing DB applications to scale up Makes MapReduce capabilities easily accessible Business opportunity: www.vertica.com

27 Summary (this and last class) MapReduce as a radically different solution to large data problems Exploit the power of parallel I/O and computation Need to think from the “ground up” – Filesystem: HDFS – Table store: HBase Basic MapReduce too complicated DB end users

28 Summary (this and last class) Efforts to simplify MapReduce based data processing PIG from Yahoo! Pig Latin a-not-so-SQL like language – A data flow language LOAD, STORE, FILTER, ORDER, GROUP, JOIN Facebook Hive supports direct SQL interface Emerging trend: Fusion of MapReduce and DB technologies

29 Happy Thanksgiving!


Download ppt "Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09"

Similar presentations


Ads by Google