High-Level MapReduce APIs CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Hive: A data warehouse on Hadoop
Introduction to Structured Query Language (SQL)
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop Ecosystem Overview
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Big Data Analytics Training
Introduction to Hadoop and HDFS
Hive Facebook 2009.
CHAPTER:14 Simple Queries in SQL Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Image taken from: slideshare
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
ITCS-3190.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Slides borrowed from Adam Shook
Overview of big data tools
Pig - Hive - HBase - Zookeeper
Spark and Scala.
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
(Hadoop) Pig Dataflow Language
Fast, Interactive, Language-Integrated Cluster Computing
(Hadoop) Pig Dataflow Language
Pig Hive HBase Zookeeper
Presentation transcript:

High-Level MapReduce APIs CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Apache Pig Apache Hive Hadoop Streaming Apache Spark

APACHE PIG

What Is Pig? Developed by Yahoo! and a top level Apache project Immediately makes data on a cluster available to non- Java programmers via Pig Latin – a dataflow language Interprets Pig Latin and generates MapReduce jobs that run on the cluster Enables easy data summarization, ad-hoc reporting and querying, and analysis of large volumes of data Pig interpreter runs on a client machine – no administrative overhead required

Pig Terms All data in Pig one of four types: – An Atom is a simple data value - stored as a string but can be used as either a string or a number – A Tuple is a data record consisting of a sequence of "fields" Each field is a piece of data of any type (atom, tuple or bag) – A Bag is a set of tuples (also referred to as a ‘Relation’) The concept of a table – A Map is a map from keys that are string literals to values that can be any data type The concept of a hash map

Pig Capabilities Support for – Grouping – Joins – Filtering – Aggregation Extensibility – Support for User Defined Functions (UDF’s) Leverages the same massive parallelism as native MapReduce

Pig Basics Pig is a client application – No cluster software is required Interprets Pig Latin scripts to MapReduce jobs – Parses Pig Latin scripts – Performs optimization – Creates execution plan Submits MapReduce jobs to the cluster

Execution Modes Pig has two execution modes – Local Mode - all files are installed and run using your local host and file system – MapReduce Mode - all files are installed and run on a Hadoop cluster and HDFS installation Interactive – By using the Grunt shell by invoking Pig on the command line $ pig grunt> Batch – Run Pig in batch mode using Pig Scripts and the "pig" command $ pig –f id.pig –p =...

Pig Latin Pig Latin scripts are generally organized as follows – A LOAD statement reads data – A series of “transformation” statements process the data – A STORE statement writes the output to the filesystem A DUMP statement displays output on the screen Logical vs. physical plans: – All statements are stored and validated as a logical plan – Once a STORE or DUMP statement is found the logical plan is executed

Example Pig Script -- Load the content of a file into a pig bag named ‘input_lines’ input_lines = LOAD 'CHANGES.txt' AS (line:chararray); -- Extract words from each line and put them into a pig bag named ‘words’ words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; -- Store the results ( executes the pig script ) STORE ordered_word_count INTO 'output';

Basic “grunt” Shell Commands Help is available $ pig -h Pig supports HDFS commands grunt> pwd – put, get, cp, ls, mkdir, rm, mv, etc.

About Pig Scripts Pig Latin statements grouped together in a file Can be run from the command line or the shell Support parameter passing Comments are supported – Inline comments '--' – Block comments /* */

Simple Data Types TypeDescription int4-byte integer long8-byte integer float4-byte (single precision) floating point double8-byte (double precision) floating point bytearrayArray of bytes; blob chararrayString (“hello world”) booleanTrue/False (case insensitive) datetimeA date and time bigintegerJava BigInteger bigdecimalJava BigDecimal

Complex Data Types TypeDescription TupleOrdered set of fields (a “row / record”) BagCollection of tuples (a “resultset / table”) MapA set of key-value pairs Keys must be of type chararray

Pig Data Formats BinStorage – Loads and stores data in machine-readable (binary) format PigStorage – Loads and stores data as structured, field delimited text files TextLoader – Loads unstructured data in UTF-8 format PigDump – Stores data in UTF-8 format YourOwnFormat! – via UDFs

Loading Data Into Pig Loads data from an HDFS file var = LOAD 'employees.txt'; var = LOAD 'employees.txt' AS (id, name, salary); var = LOAD 'employees.txt' using PigStorage() AS (id, name, salary); Each LOAD statement defines a new bag – Each bag can have multiple elements (atoms) – Each element can be referenced by name or position ($n) A bag is immutable A bag can be aliased and referenced later

Input And Output STORE – Writes output to an HDFS file in a specified directory grunt> STORE processed INTO 'processed_txt'; Fails if directory exists Writes output files, part-[m|r]-xxxxx, to the directory – PigStorage can be used to specify a field delimiter DUMP – Write output to screen grunt> DUMP processed;

Relational Operators FOREACH – Applies expressions to every record in a bag FILTER – Filters by expression GROUP – Collect records with the same key ORDER BY – Sorting DISTINCT – Removes duplicates

FOREACH...GENERATE Use the FOREACH …GENERATE operator to work with rows of data, call functions, etc. Basic syntax: alias2 = FOREACH alias1 GENERATE expression; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = FOREACH alias1 GENERATE col1, col2; DUMP alias2; (1,2) (4,2) (8,3) (4,3) (7,2) (8,4)

FILTER...BY Use the FILTER operator to restrict tuples or rows of data Basic syntax: alias2 = FILTER alias1 BY expression; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = FILTER alias1 BY (col1 == 8) OR (NOT (col2+col3 > col1)); DUMP alias2; (4,2,1) (8,3,4) (7,2,5) (8,4,3)

GROUP...ALL Use the GROUP…ALL operator to group data – Use GROUP when only one relation is involved – Use COGROUP with multiple relations are involved Basic syntax: alias2 = GROUP alias1 ALL; Example: DUMP alias1; (John,18,4.0F) (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F) alias2 = GROUP alias1 BY col2; DUMP alias2; (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)}) (20,{(Bill,20,3.9F)})

ORDER...BY Use the ORDER…BY operator to sort a relation based on one or more fields Basic syntax: alias = ORDER alias BY field_alias [ASC|DESC]; Example: DUMP alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) alias2 = ORDER alias1 BY col3 DESC; DUMP alias2; (7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)

DISTINCT... Use the DISTINCT operator to remove duplicate tuples in a relation. Basic syntax: alias2 = DISTINCT alias1; Example: DUMP alias1; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) alias2= DISTINCT alias1; DUMP alias2; (8,3,4) (1,2,3) (4,3,3)

Relational Operators FLATTEN – Used to un-nest tuples as well as bags INNER JOIN – Used to perform an inner join of two or more relations based on common field values OUTER JOIN – Used to perform left, right or full outer joins SPLIT – Used to partition the contents of a relation into two or more relations SAMPLE – Used to select a random data sample with the stated sample size

INNER JOIN... Use the JOIN operator to perform an inner, equi- join join of two or more relations based on common field values The JOIN operator always performs an inner join Inner joins ignore null keys – Filter null keys before the join JOIN and COGROUP operators perform similar functions – JOIN creates a flat set of output records – COGROUP creates a nested set of output records

INNER JOIN Example DUMP Alias1; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) DUMP Alias2; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) Join Alias1 by Col1 to Alias2 by Col1 Alias3 = JOIN Alias1 BY Col1, Alias2 BY Col1; Dump Alias3; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)

OUTER JOIN... Use the OUTER JOIN operator to perform left, right, or full outer joins – Pig Latin syntax closely adheres to the SQL standard The keyword OUTER is optional – keywords LEFT, RIGHT and FULL will imply left outer, right outer and full outer joins respectively Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas Outer joins will only work for two-way joins – To perform a multi-way outer join perform multiple two- way outer join statements

OUTER JOIN Examples Left Outer Join – A = LOAD 'a.txt' AS (n:chararray, a:int); – B = LOAD 'b.txt' AS (n:chararray, m:chararray); – C = JOIN A by $0 LEFT OUTER, B BY $0; Full Outer Join – A = LOAD 'a.txt' AS (n:chararray, a:int); – B = LOAD 'b.txt' AS (n:chararray, m:chararray); – C = JOIN A BY $0 FULL OUTER, B BY $0;

User-Defined Functions Natively written in Java, packaged as a jar file – Other languages include Jython, JavaScript, Ruby, Groovy, and Python Register the jar with the REGISTER statement Optionally, alias it with the DEFINE statement REGISTER /src/myfunc.jar; A = LOAD 'students'; B = FOREACH A GENERATE myfunc.MyEvalFunc($0);

DEFINE DEFINE can be used to work with UDFs and also streaming commands – Useful when dealing with complex input/output formats /* read and write comma-delimited data */ DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(',')) OUTPUT(stdout USING PigStreaming(',')); A = STREAM X THROUGH Y; /* Define UDFs to a more readable format */ DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX; A = LOAD ‘student_data’ AS (name:chararray, gpa1:float, gpa2:double); B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2); DUMP B;

APACHE HIVE

What Is Hive? Developed by Facebook and a top-level Apache project A data warehousing infrastructure based on Hadoop Immediately makes data on a cluster available to non- Java programmers via SQL like queries Built on HiveQL (HQL), a SQL-like query language Interprets HiveQL and generates MapReduce jobs that run on the cluster Enables easy data summarization, ad-hoc reporting and querying, and analysis of large volumes of data

What Hive Is Not Hive, like Hadoop, is designed for batch processing of large datasets Not an OLTP or real-time system Latency and throughput are both high compared to a traditional RDBMS – Even when dealing with relatively small data ( <100 MB )

Data Hierarchy Hive is organised hierarchically into: – Databases: namespaces that separate tables and other objects – Tables: homogeneous units of data with the same schema Analogous to tables in an RDBMS – Partitions: determine how the data is stored Allow efficient access to subsets of the data – Buckets/clusters For subsampling within a partition Join optimization

HiveQL HiveQL / HQL provides the basic SQL-like operations: – Select columns using SELECT – Filter rows using WHERE – JOIN between tables – Evaluate aggregates using GROUP BY – Store query results into another table – Download results to a local directory (i.e., export from HDFS) – Manage tables and queries with CREATE, DROP, and ALTER

Primitive Data Types TypeComments TINYINT, SMALLINT, INT, BIGINT1, 2, 4 and 8-byte integers BOOLEANTRUE/FALSE FLOAT, DOUBLESingle and double precision real numbers STRINGCharacter string TIMESTAMPUnix-epoch offset or datetime string DECIMALArbitrary-precision decimal BINARYOpaque; ignore these bytes

Complex Data Types TypeComments STRUCTA collection of elements If S is of type STRUCT {a INT, b INT}: S.a returns element a MAPKey-value tuple If M is a map from 'group' to GID: M['group'] returns value of GID ARRAYIndexed list If A is an array of elements ['a','b','c']: A[0] returns 'a'

HiveQL Limitations HQL only supports equi-joins, outer joins, left semi-joins Because it is only a shell for mapreduce, complex queries can be hard to optimise Missing large parts of full SQL specification: – HAVING clause in SELECT – Correlated sub-queries – Sub-queries outside FROM clauses – Updatable or materialized views – Stored procedures

Hive Metastore Stores Hive metadata Default metastore database uses Apache Derby Various configurations: – Embedded (in-process metastore, in-process database) Mainly for unit tests – Local (in-process metastore, out-of-process database) Each Hive client connects to the metastore directly – Remote (out-of-process metastore, out-of-process database) Each Hive client connects to a metastore server, which connects to the metadata database itself

Hive Warehouse Hive tables are stored in the Hive “warehouse” – Default HDFS location: /user/hive/warehouse Tables are stored as sub-directories in the warehouse directory Partitions are subdirectories of tables External tables are supported in Hive The actual data is stored in flat files

Hive Schemas Hive is schema-on-read – Schema is only enforced when the data is read (at query time) – Allows greater flexibility: same data can be read using multiple schemas Contrast with an RDBMS, which is schema-on- write – Schema is enforced when the data is loaded – Speeds up queries at the expense of load times

Create Table Syntax CREATE TABLE table_name (col1 data_type, col2 data_type, col3 data_type, col4 datatype ) ROW FORMAT DELIMITED FILEDS TERMINATED BY ',' STORED AS format_type;

Simple Table CREATE TABLE page_view (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

More Complex Table CREATE TABLE employees ( (name STRING, salary FLOAT, subordinates ARRAY, deductions MAP, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

External Table CREATE EXTERNAL TABLE page_view_stg (viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/staging/page_view';

More About Tables CREATE TABLE – LOAD: file moved into Hive’s data warehouse directory – DROP: both metadata and data deleted CREATE EXTERNAL TABLE – LOAD: no files moved – DROP: only metadata deleted – Use this when sharing with other Hadoop applications, or when you want to use multiple schemas on the same data

Partitioning Can make some queries faster Divide data based on partition column Use PARTITION BY clause when creating table Use PARTITION clause when loading data SHOW PARTITIONS will show a table’s partitions

Bucketing Can speed up queries that involve sampling the data – Sampling works without bucketing, but Hive has to scan the entire dataset Use CLUSTERED BY when creating table – For sorted buckets, add SORTED BY To query a sample of your data, use TABLESAMPLE

Browsing Tables And Partitions CommandComments SHOW TABLES; Show all the tables in the database SHOW TABLES 'page.*'; Show tables matching the specification ( uses regex syntax ) SHOW PARTITIONS page_view; Show the partitions of the page_view table DESCRIBE page_view; List columns of the table DESCRIBE EXTENDED page_view; More information on columns (useful only for debugging ) DESCRIBE page_view PARTITION (ds=' '); List information about a partition

Loading Data Use LOAD DATA to load data from a file or directory – Will read from HDFS unless LOCAL keyword is specified – Will append data unless OVERWRITE specified – PARTITION required if destination table is partitioned LOAD DATA LOCAL INPATH '/tmp/pv_ _us.txt' OVERWRITE INTO TABLE page_view PARTITION (date=' ', country='US')

Inserting Data Use INSERT to load data from a Hive query – Will append data unless OVERWRITE specified – PARTITION required if destination table is partitioned FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt=' ', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US';

Inserting Data Normally only one partition can be inserted into with a single INSERT A multi-insert lets you insert into multiple partitions FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION ( dt=' ', country='US‘ ) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION ( dt=' ', country='CA' ) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION ( dt=' ', country='UK' ) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'UK';

Inserting Data During Table Creation Use AS SELECT in the CREATE TABLE statement to populate a table as it is created CREATE TABLE page_view AS SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url FROM page_view_stg pvs WHERE pvs.country = 'US';

Loading And Inserting Data: Summary Use thisFor this purpose LOAD Load data from a file or directory INSERT Load data from a query One partition at a time Use multiple INSERTs to insert into multiple partitions in the one query CREATE TABLE AS (CTAS) Insert data while creating a table Add/modify external fileLoad new data into external table

Sample Select Clauses Select from a single table SELECT * FROM sales WHERE amount > 10 AND region = "US"; Select from a partitioned table SELECT page_views.* FROM page_views WHERE page_views.date >= ' ' AND page_views.date <= ' '

Relational Operators ALL and DISTINCT – Specify whether duplicate rows should be returned – ALL is the default (all matching rows are returned) – DISTINCT removes duplicate rows from the result set WHERE – Filters by expression – Does not support IN, EXISTS or sub-queries in the WHERE clause LIMIT – Indicates the number of rows to be returned

Relational Operators GROUP BY – Group data by column values – Select statement can only include columns included in the GROUP BY clause ORDER BY / SORT BY – ORDER BY performs total ordering Slow, poor performance – SORT BY performs partial ordering Sorts output from each reducer

Advanced Hive Operations JOIN – If only one column in each table is used in the join, then only one MapReduce job will run This results in 1 MapReduce job: SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key = c.key This results in 2 MapReduce jobs: SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key2 = c.key – If multiple tables are joined, put the biggest table last and the reducer will stream the last table, buffer the others – Use left semi-joins to take the place of IN/EXISTS SELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;

Advanced Hive Operations JOIN – Do not specify join conditions in the WHERE clause Hive does not know how to optimise such queries Will compute a full Cartesian product before filtering it Join Example SELECT a.ymd, a.price_close, b.price_close FROM stocks a JOIN stocks b ON a.ymd = b.ymd WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND a.ymd > ' ';

Hive Stinger MPP-style execution of Hive queries Available since Hive 0.13 No MapReduce We will talk about this more when we get to SQL on Hadoop

HADOOP STREAMING no cool logo

Overview Means to link any program that reads/writes stdin/stdout using MapReduce Configure your job to use built-in Java code, bash executables, or custom scripts Mappers and reducers transform input key/value pairs into lines of text – \t Programs output tab-delimited key value pairs as text – \t

-input -output -mapper -combiner -reducer -inputformat -outputformat -partitioner -numReduceTasks -inputreader -file Commands -cmdenv (env.var commands) -mapdebug (run script when map task fails) -reducedebug (run script when reduce task fails) -lazyOutput -background -verbose -info -help

Example! – Line Count ~]$ hadoop jar hadoop-streaming.jar \ -input CHANGES.txt \ -output streaming-out \ -mapper /bin/cat \ -reducer "/usr/bin/wc -l" ~]$ hdfs dfs -cat streaming-out/* 13861

Example! – Python Word Count ~]$ cat wcMapper.py #!/usr/bin/python import sys for line in sys.stdin: words = line.strip().split() for word in words: print '%s\t%s' % (word, 1) exit(0)

~]$ cat wcReducer.py #!/usr/bin/python import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count) exit(0)

Testing Scripts ~]$ chmod +x./wcMapper.py./wcReducer.py ~]$ cat CHANGES.txt |./wcMapper.py | \ sort |./wcReducer.py via3106 to2001 the1733 in1170 a1037 and927 for819 of816 Fix792 cutting)751 ** Data output commands shown without "| sort –nrk2 | head"

~]$ hadoop jar hadoop-streaming.jar \ -files wcMapper.py,wcReducer.py \ -input CHANGES.txt \ -output wc-streaming-out \ -mapper wcMapper.py \ -reducer wcReducer.py ~]$ hdfs dfs –cat wc-streaming-out/* via3106 to2001 the1733 in1170 a1037 and927 for819 of816 Fix792 cutting)751 ** Data output commands shown without "| sort –nrk2 | head"

APACHE SPARK

Concern! Tell me some algorithms that require many iterations What's wrong with Hadoop MapReduce? – In this context only, please (:

New Frameworks! Researchers have developed new frameworks to keep intermediate data in-memory – Only support specific computation patterns (Map...Reduce... repeat) – No abstractions for general re-use of data

Enter: RDDs Or Resilient Distributed Datasets Fault-tolerant parallel data structures that enables: – Persisting data in memory – Specifying partitioning schemes for optimal placement – Manipulating them with a rich set of operators

Apache Spark Lightning-Fast Cluster Computation Open-source top-level Apache project that came out of Berkeley in 2010 General-purpose cluster computation system High-level APIs in Scala, Java, and Python Higher-level tools: – Shark for HiveQL on Spark – MLlib for machine learning – GraphX for graph processing – Spark Streaming

Glossary TermMeaning ApplicationUser program built on spark, consisting of a driver program and executors on the cluster Driver ProgramThe process running the main function of the application and creating the SparkContext Cluster ManagerAn external service for acquiring resources on the cluster Worker NodeAny node that can run application code on the cluster ExecutorA process launched for an application on a worker node that runs tasks and keeps data in memory or disk storage. Each application has its own executors. TaskA unit of work that will be sent to one executor JobA parallel computation StageEach job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs

RDD Persistence and Partitioning Persistence – Users can control which RDDs will be reused and choose a storage strategy Partitioning – What we know and love! – Hash-partitioning based on some key for efficient joins

RDD Fault-Tolerance Replicating data in-flight is costly and hard Instead of replicating every data set, let's just log the transformations of each data set to keep its lineage Loss of an RDD partition can be rebuilt by replaying the transformations – Only the lost partitions need to be rebuilt!

RDD Storage Transformations are lazy operations No computations occur until an action RDDs can be persisted in-memory, but are spilled to disk if necessary Users can specify a number of flags to persist the data – Only on disk – Partitioning schemes – Persistence priorities

RDD Eviction Policy LRU policy at an RDD level New RDD partition is computed, but not enough space? Evict partition from the least recently accessed RDD – Unless it is the same RDD as the one with the new partition

Example! Log Mining Say you want a search through terabytes of log files stored in HDFS for errors and play around with them lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist()

Example! Log Mining // Count number of errors logs errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array errors.filter(_.contains("HDFS")).map(_.split('\t')(3)).collect()

Spark Execution Flow Nothing happens to errors until an action occurs The original HDFS file is not stored in-memory, only the final RDD This will greatly increase all of the future actions on the RDD

Architecture

PageRank

Spark PageRank // Load graph as an RDD of (URL, outlinks) pairs val links = spark.textFile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targetURL, float) pairs // with the contributions sent by each page val contribs = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = contribs.reduceByKey((x,y) => x+y).mapValues(sum => a/N + (1-a)*sum) }

Spark PageRank Lineage

Tracking Lineage Narrow vs. Wide Dependencies

Scheduler DAGs

Spark API Every data set is an object, and transformations are invoked on these objects Start with a data set, then transform it using operators like map, filter, and join Then, do some actions like count, collect, or save

Spark API

Iterative Example Search for a hyperplane w that best separates two sets of points val points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient }

Iterative Performance

Performance with Failures

Python Sorted Word Count # Execute with pyspark from pyspark import SparkContext sc = SparkContext("local", "WordCount") file = sc.textFile("hdfs://491vm:9000/user/shadam1/CHANGES.txt") counts = file.flatMap(lambda line: line.lower().split(" ")) \.filter(lambda (word): word != "") \.map(lambda word: (word, 1)) \.reduceByKey(lambda c1, c2: c1 + c2) \.map(lambda (word, count): (count, word)) \.sortByKey(False) counts.saveAsTextFile("hdfs://491vm:9000/user/shadam1/spark-wc")

User Applications In-Memory Analytics – Sped up Hive queries that ran a series of aggregations over the same data set Traffic Modeling – Modeled traffic congestion from sporadic GPS data points Twitter Spam Classification – Identified link spam in Twitter messages via linear regression models

Spark Review RDDs are bad ass Spark is also pretty cool – Because it implements RDDs in a bunch of languages I wish I could play with it more because it seems too good to be true

References p-mapreduce-client/hadoop-mapreduce- client-core/HadoopStreaming.html p-mapreduce-client/hadoop-mapreduce- client-core/HadoopStreaming.html