How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel.

Slides:



Advertisements
Similar presentations
MapReduce With a heavy debt to: Google Map Reduce OSDI 2004 slides code.google.com.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
INTRODUCTION Chapter 1 1. Java CPSC 1100 University of Tennessee at Chattanooga 2  Difference between Visual Logic & Java  Lots  Visual Logic Flowcharts.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Computations have to be distributed !
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Actores y Actrices. Peligro Please be careful! IMDb (I assume you all know?)
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Ch 4. The Evolution of Analytic Scalability
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
MapReduce.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
COMP 2903 A34s – Google and the Wisdom of Clouds Danny Silver JSOCS, Acadia University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
MapReduce M/R slides adapted from those of Jeff Dean’s.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
October 2007Peter Henderson, University of Southampton1 Located Functions Towards a Theory of Web Services Peter Henderson Dependable Systems and Software.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
By Shivaraman Janakiraman, Magesh Khanna Vadivelu.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
CS239-Lecture 4 FlumeJava Madan Musuvathi Visiting Professor, UCLA
Map Reduce.
العدد تذكيره وتأنيثه مقدمة
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Distributed System Gang Wu Spring,2018.
Distributed Systems CS
Introduction to MapReduce
CS639: Data Management for Data Science
Best in Class Reporting to What’s Next: A/I Driven Insights
MapReduce: Simplified Data Processing on Large Clusters
Year 2 Spring Term Week 11 Lesson 5
Presentation transcript:

How Google would do GREP Spring 2006

Google Massive datasets Massive numbers of machines, working in parallel

Requirements Need a programming model that –Parallelizes easily –Allows Ph.D level engineer/scientists to specify and execute NLP like tasks on the big clusters –Does not require serious expertise in parallel programming.

Map/Reduce Insight 1: much of the input/output is generic, so specify only the transformation required. Insight 2: the part of the process that says “do something to every item” is really easy to parallelize. Insight 3: Do something to every item and then collect the results

Map Output is one per line A -> [] B -> [] C ->[(C,1)] D -> [] C -> [(C,1)] Output is a possibly empty list of key/value pairs

Reduce The map/reduce implementation gathers together all pairs with same key, so reduce sees pairs of a key with a list of values [….(C,[1,1])…] Just takes the length of the list of values

Reflections This is a lot like awk, which said, “tell me what you do to each line, I’ll handle the details of delivering them to you” Behind the scenes, sensible to be clever about what the implementation does to pull pairs from a large cluster of machines, but this is not the application programmers problem.

Google’s (and Microsoft’s) papers htmlhttp://labs.google.com/papers/mapreduce. html sciprog.pdfhttp://labs.google.com/papers/sawzall- sciprog.pdf r.pdfhttp:// r.pdf