How to think in Map-Reduce Paradigm Ayon Sinha

Slides:



Advertisements
Similar presentations
Algorithms and applications
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Overview of this week Debugging tips for ML algorithms
MapReduce.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
The first part of today’s presentation is largely stolen from Ricky Ho’s.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Cluster Analysis (1).
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Presenter: Joshan V John Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan & Tien N. Nguyen Iowa State University, USA Instructor: Christoph Csallner 1 Joshan.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.
Taking Raw Data Towards Analysis 1 iCSC2015, Vince Croft, NIKHEF Exploring EDA, Clustering and Data Preprocessing Lecture 2 Taking Raw Data Towards Analysis.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.
Google’s MapReduce Connor Poske Florida State University.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Concurrent Algorithms. Summing the elements of an array
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering.
CS4432: Database Systems II Query Processing- Part 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Canopy Clustering Given a distance measure and two threshold distances T1>T2, 1. Determine canopy centers - go through The list of input points to form.
1 VLDB, Background What is important for the user.
1 ITERATIVE FILE- BASED ITEM:ITEM SIMILARITY COMPUTATION 1 ● Will Holcomb – Vanderbilt University ● Project Aura Intern.
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Information Retrieval in Practice
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop MapReduce Framework
Map Reduce.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Fintan The Amazing Fish of Knowledge…
Group 15 Swathi Gurram Prajakta Purohit
Charles Tappert Seidenberg School of CSIS, Pace University
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
5/7/2019 Map Reduce Map reduce.
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Map Reduce, Types, Formats and Features
Presentation transcript:

How to think in Map-Reduce Paradigm Ayon Sinha

Overview Think Distributed, think super large data Convert single flow algorithms to MapReduce Q&A

Think Keys and values Think about the output first in terms of Key- Value. e.g.  Dimensions:Metrics (date, webpage, locale: #users, #visits, #abandonment)  Membership:List of members ( cluster centroid representing HackerDojo students: [member1, member2, ….] )  Property:Value ( userId: name, location, #transactions, purchase Categories with frequencies )

Thinking in MapReduce contd.. How can the Mapper collect this information for the reducers How is the value distribution for keys  Be very careful of the power-law distribution and the “curse of the last reducer”  Know the appx. maximum number of values for the reducer key Input data independence

Example of Join in MapReduce Input  User-id purchase-info data files  User-id user-details data files Output  User-id : {user details, category purchase with frequencies}

Example contd. User details Mappers User purchase mappers Input to Reducer: :{D_John Doe, 123 main st, Home Town, CA P_Amazon Kindle 3 $139 03/25/2011 P_Cowboy boots, $145, 04/01/2011 P_Aviator Sunglasses $69, 03/31/ …} Aggregate and emit from Reducer Reducer for one userID

Ricky's Blog kmeans(data) { initial_centroids = pick(k, data) upload(data) writeToS3(initial_centroids) old_centroids = initial_centroids while (true){ map_reduce() new_centroids = readFromS3() if change(new_centroids, old_centroids) < delta { break } else { old_centroids = new_centroids } result = readFromS3() return result }

Mapper and Reducer

Distance measures Euclidean distance Manhattan distance Jaccard Similarity Cosine similarity Or any other metric that suits your use-case (faster the better)  Remember there is no such thing as “absolute similarity” in real world. Even identical twins may be dissimilar in some trait that can mark them hugely dissimilar from that perspective. e.g. 2 shirts of the same brand, color and pattern is considered dissimilar by buyer if the size is different, but they are similar for the manufacturer.

K-Means Time complexity Non-parallel Algorithm  K* n * O(distance function) * num iterations Map Reduce version  K* n * O(distance function) * num iterations * O(M-R)/ s  O(M-R) = O(K log K * s * (1/p)) where:  K is the number of clusters  s is the number of nodes  p is the ping time between nodes (assuming equal ping times between all nodes in the network)

Recommendations Do not limit your thinking to one phase of Map-Reduce. There are very few problems in the real world that can be solved by a single MapReduce phase. Think Map-Map-Reduce, Map-Reduce-Reduce, Map-Reduce-Map-Reduce and so on. Partition and filter your data as early as possible in the flow. “What is the other reason match-making sites ask for preferences before running their massively parallel match algorithms?” Apply simple algorithms first to large data and slowly increase complexity as needed. Is the added complexity and maintenance costs worth it in a business setting? It has been shown by Brill, Banko in Scaling to Very Very Large Corpora for Natural Language Disambiguation, 2001, that vast amounts of data can help less complex algorthims to perform equal or better than more comlex one with less data. Remember “The curse of the last reducer”. One cluster will invariably(with real data) have way more points to process than most others.

References Ricky Ho's blog Pragmatic Programming TechniquesPragmatic Programming Techniques Collective Intelligence by Satnam Alag Programming Collective Intelligence by Toby Segaran Algorithms of the Intelligent Web by Marmanis, Babenko Brill, Banko.( 2001) Scaling to Very Very Large Corpora for Natural Language Disambiguation