1 Tree and Graph Processing On Hadoop Ted Malaska.

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

epiC: an Extensible and Scalable System for Processing Big Data
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Why Spark on Hadoop Matters
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
Our Experience Running YARN at Scale Bobby Evans.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
Nov 2006 Google released the paper on BigTable.
Next Generation of Apache Hadoop MapReduce Owen
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
OMOP CDM on Hadoop Reference Architecture
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Apache hadoop & Mapreduce
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
How did it start? • At Google • • • • Lots of semi structured data
Spark Presentation.
Yak: A High-Performance Big-Data-Friendly Garbage Collector
MapReduce Simplified Data Processing on Large Cluster
Speculative Region-based Memory Management for Big Data Systems
Lab #2 - Create a movies dataset
Ministry of Higher Education
Introduction to Spark.
Yak: A High-Performance Big-Data-Friendly Garbage Collector
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Introduction to Apache
Pregelix: Think Like a Vertex, Scale Like Spandex
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
CS639: Data Management for Data Science
CS639: Data Management for Data Science
Presentation transcript:

1 Tree and Graph Processing On Hadoop Ted Malaska

2 Schedule Intro Overview of Hadoop and Eco-System Summarize Tree Rooting MR Overview/Implementation Options Hbase Overview/Implementation Options Giraph Overview/Implementation Options Spark Overview/Implementation Options Summery Quesitons

3 Intro Hi there

4 Overview of Hadoop and Eco-System Searc h NoSql Machine Learning LFP RTQ Streaming Ingestion Batch HDFS Security and Access Controls Auditing and Monitoring Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka Storm Spark Streaming Spark Impala Mahout Oryx RR Python Streaming SAS HBase Accumulo NFS Search SolR

5 In Scope for Tonight Searc h NoSql Machine Learning LFP RTQ Streaming Ingestion Batch HDFS Security and Access Controls Auditing and Monitoring Map Reduce Pig Crunch Hive Giraph Sqoop Flume Kafka Storm Spark Streaming Spark Impala Mahout Oryx RR Python Streaming SAS HBase Accumulo NFS Search SolR

6 Summarize Tree Rooting Basic Tree True Root Leafs Branches Vertex Edge Depth

7 Summarize Tree Rooting More Complex Tree Circular Link Multiple Parents

8 Summarize Tree Rooting Merging Trees Borderline True Graph Problem Multi Rooted Vertex Multi Rooted Vertex True Root

9 Summarize Tree Rooting Know your data

10 Basic Storage Format | Example | | |

11 Preprocessing Terming Data Nodes and edges have data Data has weight Normally linkage information is under 10% of true data size Organize Data by Partitioning

12 Basic Solution Step 1: Identify Roots Echo to all edges Vertexes with that receive no echoes are roots Root the root Step 2: Walk the tree Echo from last newly rooted Vertex to all edges If vertex is not already rooted then root it | | | |R: |201|R: |202|R: |R:Null 202|301|R:Null 301|R:Null 101|R: |201|R: |202|R: |R: |301|R: |R:Null 101|R: |201|R: |202|R: |R: |301|R: |R:101

13 Map Reduce Massive parallel processing on Hadoop Based on the Google 2004 MapReduce white paper Able to process PBs of data

14 Map Reduce Data Blocks Mapper Sort & Shuffle Mapper Data Blocks

15 Map Reduce Self Joins Always dumping two output: Newly Rooted Still Un-Rooted All Data Un-Rooted Newly Rooted Un-Rooted Newly Rooted Old Rooted 0 MR - Stage0 Root Identifying MR - Stage0 Root Identifying MR – Stage1 Rooting MR – Stage1 Rooting Un-Rooted Newly Rooted Old Rooted 0 MR – Stage2 Rooting MR – Stage2 Rooting Old Rooted 1

16 Map Reduce Great for large batch operations No memory limit Not good at iterations

17 HBase Largest and Most used NoSql Implementation in the World Based on the Google 2006 BigTable white paper Imagine it like a giant HashMap with keys and values Handles 100k of operations a second on even a small 10 node cluster

18 HBase Getting Client HBase Master HBase Region Server Block Cache

19 HBase Putting Client HBase Master HBase Region Server WAL MemStore HFile WAL MemStore WAL MemStore

20 HBase Good for graph traversing Bad for large batch processing Scan rate about 8x slower then HDFS Good for end of a long tail

21 Giraph System built for Large Batch Graph Processing Based on Pregel 2009 white paper Hardened by LinkedIn and FaceBook Recorded to handle up to a Trillion edges

22 Giraph Loading Data Blocks Worker Master

Communication 23 Giraph (Bulk Synchronous Parallel) Worker Local vertex computing Barrier synchronization Local vertex computing

24 Giraph Most mature bulk graph processing out there Of all the solutions, most graph focused

25 Spark At Berkeley around 2011 some asked is we could do better then MR Take advantage of lower cost memory Building on everything before

26 Spark Worker Dag Scheduler (Like a queue planner Dag Scheduler (Like a queue planner Spark Worker RDD Objects Task Threads Block Manager Rdd1.join(rdd2). groupBy(…).filter(…) Task Scheduler Threads Block Manager Cluster Manager Cluster Manager

27 Spark Implementations Onion MR approach with Basic Spark Pregel approach with Bagel or GraphX Bagel is a Façade over Generic Spark Functionality GraphX is an effort extend to Spark Less code Learning curve Its Raw will be changing a lot in the next year