Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Lecture 6: Practical Computing with Large Data Sets CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks to Haeberlen.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Next Generation of Apache Hadoop MapReduce Owen
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
1 Source A. Haeberlen, Z. Ives University of Pennsylvania MapReduceIntro.pptx Introduction to MapReduce.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Map reduce Cs 595 Lecture 11.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark.
Introduction to MapReduce and Hadoop
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
CS639: Data Management for Data Science
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks to Haeberlen and Ives at UPenn

Map-Reduce Problem Work in groups to design a Map-Reduce of K- means clustering.

Map-Reduce Problem How could we map reduce a Random Forest?

Map-Reduce Problem How could we map reduce a Random Forest? What about ID3?

Some additional details To make this work, we need a few more parts… The file system (distributed across all nodes): – Stores the inputs, outputs, and temporary results The driver program (executes on one node): – Specifies where to find the inputs, the outputs – Specifies what mapper and reducer to use – Can customize behavior of the execution The runtime system (controls nodes): – Supervises the execution of tasks – Esp. JobTracker

Some details Fewer computation partitions than data partitions – All data is accessible via a distributed filesystem with replication – Worker nodes produce data in key order (makes it easy to merge) – The master is responsible for scheduling, keeping all nodes busy – The master knows how many data partitions there are, which have completed – atomic commits to disk Locality: Master tries to do work on nodes that have replicas of the data Master can deal with stragglers (slow machines) by re- executing their tasks somewhere else

What if a worker crashes? We rely on the file system being shared across all the nodes Two types of (crash) faults: – Node wrote its output and then crashed Here, the file system is likely to have a copy of the complete output – Node crashed before finishing its output The JobTracker sees that the job isn’t making progress, and restarts the job elsewhere on the system (Of course, we have fewer nodes to do work…) But what if the master crashes?

Other challenges Locality – Try to schedule map task on machine that already has data Task granularity – How many map tasks? How many reduce tasks? Dealing with stragglers – Schedule some backup tasks Saving bandwidth – E.g., with combiners Handling bad records – "Last gasp" packet with current sequence number

Scale and MapReduce From a particular Google paper on a language built over MapReduce: – … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10 15 bytes of data (2.8PB) and wrote 9.9x10 12 bytes (9.3TB). Source: Interpreting the Data: Parallel Analysis with Sawzall (Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan)

Hadoop

HDFS Hadoop Distributed File System A distributed file system with – Redundant storage – Highly reliable using commodity hardware – Designed to expect and tolerate failures – Intended for use with large files – Designed for batch inserts

HDFS - Structure Files – stored as collection of blocks Blocks – 64 MB chunks of a file All blocks are replicated on at least 3 nodes The NameNode (NN) manages metadata about files and blocks The SecondaryNameNode (SNN) holds backups of the NN data. DataNodes (DN) store and serve blocks

HDFS - Replication Multiple copies of a block are stored Strategy – Copy #1 on another node in the same rack – Copy #2 on another node in a different rack

HDFS – Write Handling

HDFS – Read Handling

Handling Node Failure DNs check in with the NN to report health. Upon failure the NN orders DNs to replicate under-replicated blocks. Automated fail-over. – But highly inefficient What does this optimize for?

MapReduce – Jobs and Tasks Job – a user submitted map/reduce implementation Task – a single mapper or reducer task – Failed tasks get retried automatically – Tasks are run local to their data, if possible JobTracker (JT) manages job submission and task delegation TaskTrackers (TT) ask for work and execute tasks

MapReduce Architecture

What happens when a task fails? Tasks WILL fail! JT automatically retries failed tasks up to N times – After N failed attempts for a task, the job fails. – Why?

What happens when a task fails? Tasks WILL fail! JT automatically retries failed tasks up to N times – After N failed attempts for a task, the job fails. Some tasks slower than others Speculative execution is JT starting up multiples of the same task – First one to complete wins, others are killed – When is this useful?

Data Locality Move computation to the data! Moving data between nodes is assumed to have a high cost Try to schedule tasks on nodes with data When not possible TT has to fetch data from DN.

MapReduce is good for… Embarrassingly parallel problems Summing, grouping, filtering, joining Offline batch jobs on massive data sets Analyzing an entire large dataset

MapReduce is ok for… Iterative jobs – Graph algorithms – Each iteration must read/write data to disk – IO/latency cost of each iteration is high

MapReduce is bad for… Jobs with shared state or coordination – Tasks should be share-nothing – Shared-state requires a scalable state store Low-latency jobs Jobs on small datasets Finding individual records

Hadoop Architecture

Hadoop Stack

Hadoop Stack Components HBase – open source, non-relational, distributed database. Provides fault tolerant way to store large quantities of sparse data Pig – high level platform for creating MapReduce programs using the language Pig Latin. Hive – data warehousing infrastructure, provides summarization, query, and analysis. Cascading – software abstraction layer to create and execute complex data processing workflows

Apache Spark

What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine – In-memory data storage for very fast iterative queries – General execution graphs and powerful optimizations – Up to 40x faster than Hadoop Compatible with Hadoop’s storage APIs – Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

Spark Programs divided into two: – Driver program – Workers programs Worker programs run on cluster nodes, or local threads RDDs are distributed across workers

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: – More complex, multi-stage applications (e.g. iterative graph algorithms and machine learning) – More interactive ad-hoc queries Both multi-stage and interactive apps require faster data sharing across parallel jobs

Data Sharing in MapReduce iter. 1 iter Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3... HDFS read Slow due to replication, serialization, and disk IO

iter. 1 iter Input Data Sharing in Spark Distributed memory Input query 1 query 2 query 3... one-time processing × faster than network and disk

Spark Programming Model Key idea: resilient distributed datasets (RDDs) – Distributed collections of objects that can be cached in memory across cluster nodes – Manipulated through various parallel operators – Automatically rebuilt on failure Interface – Clean language-integrated API in Scala – Can be used interactively from Scala console

Constructing RDDs Parallelize existing collections (python lists) Transforming existing RDDs Build from files in HDFS or other storage systems.

RDDs Programmer Specifies number of partitions for an RDD Two types of operations: transformations and actions

RDD Transforms Transforms are lazy – Not computed immediately Transformed RDD is executed only when an action runs on it – Why? Persist (cache) RDDs in memory or disk

Working with RDDs Create an RDD from a data source Apply transformations to an RDD (map, filter) Apply actions to an RDD (collect, count)

Creating an RDD Create an RDD from a Python collection

Create an RDD from a File

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worker Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)).map(_.split(‘\t’)(2)) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...))

Example: Logistic Regression Goal: find best line separating two sets of points + – – – – – – – – – + target – random initial line

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

Supported Operators map filter groupBy sort join leftOuterJoin rightOuterJoin reduce count reduceByKey groupByKey first union cross sample cogroup take partitionBy pipe save...

For next time Project Presentations, discussion on project scoping for Big Data.