Apache Tez : Accelerating Hadoop Data Processing

Slides:



Advertisements
Similar presentations
Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Advertisements

Introduction to Grid Application On-Boarding Nick Werstiuk
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Spark: Cluster Computing with Working Sets
Technical Architectures
Distributed Computations
MIT iCampus iLabs Software Architecture Workshop June , 2006.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Christopher Jeffers August 2012
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Our Experience Running YARN at Scale Bobby Evans.
Introduction to Hadoop and HDFS
MapReduce M/R slides adapted from those of Jeff Dean’s.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
BIG DATA/ Hadoop Interview Questions.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Large-scale file systems and Map-Reduce
Open Source distributed document DB for an enterprise
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Overview of big data tools
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Overview of Workflows: Why Use Them?
Big-Data Analytics with Azure HDInsight
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Presentation transcript:

Apache Tez : Accelerating Hadoop Data Processing Bikas Saha @bikassaha

Tez – Introduction Distributed execution framework targeted towards data- processing applications. Based on expressing a computation as a dataflow graph. Highly customizable to meet a broad spectrum of use cases. Built on top of YARN – the resource management framework for Hadoop. Open source Apache project and Apache licensed.

Tez – Hadoop 1 ---> Hadoop 2 Monolithic Resource Management – MR Execution Engine – MR Layered Resource Management – YARN Engines – Hive, Pig, Cascading, Your App!

Tez – Empowering Applications Tez solves hard problems of running on a distributed Hadoop environment Apps can focus on solving their domain specific problems This design is important to be a platform for a variety of applications App Custom application logic Custom data format Custom data transfer technology Tez Distributed parallel execution Negotiating resources from the Hadoop framework Fault tolerance and recovery Horizontal scalability Resource elasticity Shared library of ready-to-use components Built-in performance optimizations Security

Tez – End User Benefits Better performance of applications Built-in performance + Application define optimizations Better predictability of results Minimization of overheads and queuing delays Better utilization of compute capacity Efficient use of allocated resources Reduced load on distributed filesystem (HDFS) Reduce unnecessary replicated writes Reduced network usage Better locality and data transfer using new data patterns Higher application developer productivity Focus on application business logic rather than Hadoop internals

Tez – Design considerations Look to the Future with an eye on the Past Don’t solve problems that have already been solved. Or else you will have to solve them again! Leverage discrete task based compute model for elasticity, scalability and fault tolerance Leverage several man years of work in Hadoop Map-Reduce data shuffling operations Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN Leverage built-in security mechanisms in Hadoop for privacy and isolation

Tez – Problems that it addresses Expressing the computation Direct and elegant representation of the data processing flow Interfacing with application code and new technologies Performance Late Binding : Make decisions as late as possible using real data from at runtime Leverage the resources of the cluster efficiently Just work out of the box! Customizable to let applications tailor the job to meet their specific requirements Operation simplicity Painless to operate, experiment and upgrade

Tez – Simplifying Operations No deployments to do. No side effects. Easy and safe to try it out! Tez is a completely client side application. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 Client Machine Node Manager Node Manager Client Machine TezTask TezTask TezClient TezClient

Tez – Expressing the computation Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). Vertices in the graph represent data transformations Edges represent data movement from producers to consumers Preprocessor Stage Task-1 Task-2 Samples Sampler Partition Stage Ranges Task-1 Task-2 Distributed Sort Aggregate Stage Task-1 Task-2

Tez – Expressing the computation Tez provides the following APIs to define the processing DAG API Defines the structure of the data processing and the relationship between producers and consumers Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime This is how the connection of tasks in the job gets specified Runtime API Defines the interfaces using which the framework and app code interact with each other App code transforms data and moves it between tasks This is how we specify what actually executes in each task on the cluster nodes

Tez – DAG API Defines the global processing flow // Define DAG DAG dag = DAG.create(); // Define Vertex Vertex Map1 = Vertex.create(Processor.class); // Define Edge Edge edge = Edge.create(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Map1).addEdge(edge)… Map1 Map2 Scatter Gather Reduce1 Reduce2 Scatter Gather Join

Tez – DAG API Edge properties define the connection between producer and consumer tasks in the DAG Data movement – Defines routing of data between tasks One-To-One : Data from the ith producer task routes to the ith consumer task. Broadcast : Data from a producer task routes to all consumer tasks. Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task.

Tez – Logical DAG expansion at Runtime Map1 Map2 Reduce1 Reduce2 Join

Tez – Runtime API Flexible Inputs-Processor-Outputs Model Thin API layer to wrap around arbitrary application code Compose inputs, processor and outputs to execute arbitrary processing Applications decide logical data format and data transfer technology Customize for performance Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService. Built on the same API. Your implementations are as first class as ours!

Tez – Task Composition Logical DAG V-A V-B V-C Output-1 Output-3 Processor-A Input-2 Processor-B Input-4 Processor-C Task A Task B Task C V-A Edge AB Edge AC V-B V-C V-A = { Processor-A.class } V-B = { Processor-B.class } V-C = { Processor-C.class } Edge AB = { V-A, V-B, Output-1.class, Input-2.class } Edge AC = { V-A, V-C, Output-3.class, Input-4.class }

Tez – Library of Inputs and Outputs Classical ‘Map’ Classical ‘Reduce’ Map Processor HDFS Input Sorted Output Reduce Processor Shuffle Input HDFS Output What is built in? Hadoop InputFormat/OutputFormat SortedGroupedPartitioned Key-Value Input/Output UnsortedGroupedPartitioned Key-Value Input/Output Key-Value Input/Output Reduce Processor Shuffle Input Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce

Tez – Composable Task Model Adopt Evolve Optimize HDFS Input Remote File Server Input HDFS Input Remote File Server Input RDMA Input Native DB Input Hive Processor Custom Processor Custom Processor HDFS Output Local Disk Output HDFS Output Local Disk Output Kakfa Pub-Sub Output Amazon S3 Output

Tez – Performance Benefits of expressing the data processing as a DAG Reducing overheads and queuing effects Gives system the global picture for better planning Efficient use of resources Re-use resources to maximize utilization Pre-launch, pre-warm and cache Locality & resource aware scheduling Support for application defined DAG modifications at runtime for optimized execution Change task concurrency Change task scheduling Change DAG edges Change DAG vertices

Tez – Benefits of DAG execution Faster Execution and Higher Predictability Eliminate replicated write barrier between successive computations. Eliminate job launch overhead of workflow jobs. Eliminate extra stage of map reads in every workflow job. Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Better locality because the engine has the global picture Pig/Hive - MR Pig/Hive - Tez

Tez – Container Re-Use Reuse YARN containers/JVMs to launch new tasks Reduce scheduling and launching delays Shared in-memory data across tasks JVM JIT friendly execution YARN Container YARN Container / JVM Tez Application Master TezTask Host Start Task TezTask1 Task Done Shared Objects Start Task TezTask2

Shared Object Registry Tez – Sessions Sessions Standard concepts of pre-launch and pre-warm applied Key for interactive queries Represents a connection between the user and the cluster Multiple DAGs executed in the same session Containers re-used across queries Takes care of data locality and releasing resources when idle Start Session Submit DAG Client Application Master Task Scheduler Pre Warmed JVM Shared Object Registry Container Pool

Tez – Re-Use in Action Containers Time  Task Execution Timeline

Tez – Customizable Core Engine Start vertex Vertex Manager Determines task parallelism Determines when tasks in a vertex can start. DAG Scheduler Determines priority of task Task Scheduler Allocates containers from YARN and assigns them to tasks Get container Vertex-1 Get Priority DAG Scheduler Task Scheduler Start vertex Vertex Manager Start tasks Vertex-2 Get Priority Get container

Tez – Event Based Control Plane Events used to communicate between the tasks and between task and framework Data Movement Event used by producer task to inform the consumer task about data location, size etc. Input Error event sent by task to the engine to inform about errors in reading input. The engine then takes action by re-generating the input Other events to send task completion notification, data statistics and other control plane information Map Task 1 Output1 Output2 Map Task 2 Output1 Output2 Output3 Output3 Data Event AM Router Scatter-Gather Edge Error Event Reduce Task 2 Input1 Input2

Tez – Automatic Reduce Parallelism Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism App Master Data Size Statistics Vertex Manager Map Vertex Set Parallelism Vertex State Machine Re-Route Reduce Vertex For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework … [CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three. [CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task. Cancel Task

Tez – Reduce Slow Start/Pre-launch Event Model Map completion events sent to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start. App Master Task Completed Vertex Manager Map Vertex Start Tasks Vertex State Machine Reduce Vertex So the new plan looks like this … which is how Tez handles the concurrency problem of plan re-configuration. What about “when do you actually want to launch those tasks?”? In MapReduce, there is an interesting concept of pre-launching the Reducers, in which even though the reducers are going to read the output from 100 mappers, we may want to launch the Reducers when only 50 mappers have finished – because then they can use the capacity in the cluster to start pre-fetching the data from the completed mappers while the rest of their inputs still complete. How can we do this in Tez? Again, the same Vertex Manager comes into the picture. [CLICK] As the Map Tasks complete they send Task Completed notifications to the Vertex Manager. There it can decide that enough Map tasks have finished that I need to start one of my reducers so that it can start pre-fetching the data. [CLICK] It sends a Start Task notification to the framework, which results in one of the Reduce Tasks starting. Start

Tez – Theory to Practice Performance Scalability

Tez – Hive TPC-DS Scale 200GB latency query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

Tez – Pig performance gains Demonstrates performance gains from a basic translation to a Tez DAG Deeper integration in the works for further improvements 1.5x to 3x speedup on some of the Pigmix queries.

Tez – Observations on Performance Number of stages in the DAG Higher the number of stages in the DAG, performance of Tez (over MR) will be better. Cluster/queue capacity More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. Size of intermediate output More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage. Size of data in the job For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. Offload work to the cluster Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. E.g. MR split calculation. Vertex caching The more re-computation can be avoided the better is the performance.

Tez – Data at scale Hive TPC-DS Scale 10TB

Tez – DAG definition at scale Hive : TPC-DS Query 64 Logical DAG

Tez – Container Reuse at Scale 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4)

Tez – Real World Use Cases for the API

Tez – Broadcast Edge Hive : Broadcast Join Hive – MR Hive – Tez SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive : Broadcast Join Hive – MR Hive – Tez M HDFS Store Sales scan. Group by and aggregation reduce size of this input. Inventory scan and Join Broadcast edge M HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. M M M R R R R HDFS M R R

Split multiplex de-multiplex Tez – Multiple Outputs f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Group by y Group by z Load foo Join Load g1 and Load g2 Group by y Group by z Multiple outputs Reduce follows reduce HDFS Split multiplex de-multiplex Pig : Split & Group-by Pig – MR Pig – Tez

Tez – Multiple Outputs Hive : Multi-insert queries Hive – MR FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) INSERT INTO TABLE t1 SELECT distinct ss_item_sk INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; Hive : Multi-insert queries Hive – MR Hive – Tez M HDFS Map join date_dim/store sales Two MR jobs to do the distinct Broadcast Join (scan date_dim, join store sales) M M HDFS M M M Materialize join on HDFS Distinct for customer + items R R M M M M M M HDFS R R HDFS

Partition L and Partition R Tez – One to One Edge Aggregate Sample L Join Stage sample map on distributed cache l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x USING ‘skewed’; Load & Sample Partition L Pass through input via 1-1 edge Partition R HDFS Pig : Skewed Join Pig – MR Pig – Tez Broadcast sample map Partition L and Partition R

Tez – Custom Edge Hive : Dynamically Partitioned Hash Join Hive – MR SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) M HDFS Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join (Custom vertex reads both inputs – no side file reads) Custom edge (routes outputs of previous stage to the correct Mappers of the next stage) HDFS

Tez – Bringing it all together Tez Session populates container pool Dimension table calculation and HDFS split generation in parallel Dimension tables broadcasted to Hive MapJoin tasks Final Reducer pre-launched and fetches completed inputs TPCDS – Query-27 with Hive on Tez Architecting the Future of Big Data

Tez – Bridging the Data Spectrum Fact Table Dimension Table 1 Dimension Table 1 Fact Table Broadcast Join Dimension Table 1 Broadcast join for small data sets Result Table 1 Dimension Table 2 Dimension Table 1 Broadcast Join Result Table 2 Dimension Table 3 Shuffle Join Based on data size, the query optimizer can run either plan as a single Tez job Typical pattern in a TPC-DS query Result Table 3

Tez – Current status Apache Project Focus on stability Growing community of contributors and users Latest release is 0.5.0 Focus on stability Testing and quality are highest priority Code ready and deployed on multi-node environments at scale Support for a vast topology of DAGs Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes Rapid adoption by important open source projects and by vendors

Tez – Developer Release API stability – Intuitive and clean APIs that will be supported and be backwards compatible with future releases Local Mode Execution – Allows the application to run entirely in the same JVM without any need for a cluster. Enables easy debugging using IDE’s on the developer machine Performance Debugging – Adds swim-lanes analysis tool to visualize job execution and help identify performance bottlenecks Documentation and tutorials to learn the APIs

Tez – Adoption Apache Hive Apache Pig Cascading Commercial Vendors Hadoop standard for declarative access via SQL-like interface (Hive 13) Apache Pig Hadoop standard for procedural scripting and pipeline processing (Pig 14) Cascading Developer friendly Java API and SDK Scalding (Scala API on Cascading) (3.0) Commercial Vendors ETL : Use Tez instead of MR or custom pipelines Analytics Vendors : Use Tez as a target platform for scaling parallel analytical tools to large data-sets Hive has written it’s own processor

Tez – Adoption Path Pre-requisite : Hadoop 2 with YARN Tez has zero deployment pain. No side effects or traces left behind on your cluster. Low risk and low effort to try out. Using Hive, Pig, Cascading, Scalding Try them with Tez as execution engine Already have MapReduce based pipeline Use configuration to change MapReduce to run on Tez by setting ‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml Consolidate MR workflow into MR-DAG on Tez Change MR-DAG to use more efficient Tez constructs Have custom pipeline Wrap custom code/scripts into Tez inputs-processor-outputs Translate your custom pipeline topology into Tez DAG Change custom topology to use more efficient Tez constructs Hive has written it’s own processor

Tez – Roadmap Richer DAG support Performance optimizations Usability Addition of vertices at runtime Shared edges for shared outputs Enhance Input/Output library Performance optimizations Improve support for high concurrency Improve locality aware scheduling Add framework level data statistics HDFS memory storage integration Usability Stability and testability UI integration with YARN Tools for performance analysis and debugging

Tez – Community Early adopters and code contributors welcome Adopters to drive more scenarios. Contributors to make them happen. Tez meetup for developers and users http://www.meetup.com/Apache-Tez-User-Group Technical blog series http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing Useful links Work tracking: https://issues.apache.org/jira/browse/TEZ Code: https://github.com/apache/tez  Developer list: dev@tez.apache.org  User list: user@tez.apache.org  Issues list: issues@tez.apache.org

Tez – Takeaways Distributed execution framework that models processing as dataflow graphs Customizable execution architecture designed for extensibility and user defined performance optimizations Works out of the box with the platform figuring out the hard stuff Zero install – Safe and Easy to Evaluate and Experiment Addresses common needs of a broad spectrum of applications and usage patterns Open source Apache project – your use-cases and code are welcome It works and is already being used by Apache Hive and Pig

Tez Thanks for your time and attention! Video with Deep Dive on Tez http://goo.gl/BL67o7 https://hortonworks.webex.com/hortonworks/lsr.php?RCID=b4382d626867fe57de70b218c74c03e4 – WordCount code walk through with 0.5 APIs. Debugging in local mode. Execution and profiling in cluster mode. (Starts at ~27 min. Ends at ~37 min) Questions? @bikassaha