CMU SCS 15-826: Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
MapReduce M/R slides adapted from those of Jeff Dean’s.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Hadoop Aakash Kag What Why How 1.
15-826: Multimedia Databases and Data Mining
Introduction to MapReduce and Hadoop
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
CS6604 Digital Libraries IDEAL Webpages Presented by
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Christos Faloutsos CMU
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos

CMU SCS Resources Software – Map/reduce paper [Dean & Ghemawat] – Tutorial: see part1, foils #9-20 –videolectures.net/kdd2010_papadimitriou_sun_yan_lsdm/ (c) 2012 C. Faloutsos2

CMU SCS (c) 2012 C. Faloutsos3 Motivation: Scalability Google: > 450,000 processors in clusters of ~2000 processors each [ Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003 ] Yahoo: 5Pb of data [Fayyad, KDD’07] Problem: machine failures, on a daily basis How to parallelize data mining tasks, then? A: map/reduce – hadoop (open-source clone)

CMU SCS (c) 2012 C. Faloutsos4 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id details

CMU SCS (c) 2012 C. Faloutsos5 2’ intro to hadoop master-slave architecture; n-way replication (default n=3) ‘group by’ of SQL (in parallel, fault-tolerant way) e.g, find histogram of word frequency –compute local histograms –then merge into global histogram select course-id, count(*) from ENROLLMENT group by course-id map reduce details

CMU SCS (c) 2012 C. Faloutsos6 User Program Reducer Master Mapper fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data (on HDFS) By default: 3-way replication; Late/dead machines: ignored, transparently (!) details

CMU SCS More details: (thanks to U Kang for the animations) (c) 2012 C. Faloutsos7

CMU SCS 8 Background: MapReduce MapReduce/Hadoop Framework Map 0Map 1Map 2 Reduce 0Reduce 1 Shuffle HDFS HDFS: fault tolerant, scalable, distributed storage system Mapper: read data from HDFS, output (k,v) pair Reducer: read output from mappers, output a new (k,v) pair to HDFS Output sorted by the key Programmers need to provide only map() and reduce() functions (c) 2012 C. Faloutsos

CMU SCS 9 Background: MapReduce Example: histogram of fruit names Map 0Map 1Map 2 Reduce 0Reduce 1 Shuffle (apple, 1) (strawberry,1) (apple, 2) (orange, 1) (strawberry, 1) (orange, 1) HDFS map( fruit ) { output(fruit, 1); } reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } (c) 2012 C. Faloutsos

CMU SCS Conclusions Convenient mechanism to process Tera- and Peta-bytes of data, on >hundreds of machines User provides map() and reduce() functions Hadoop handles the rest (eg., machine failures etc) (c) 2012 C. Faloutsos10