Matrix Multiplication in Hadoop

Slides:



Advertisements
Similar presentations
Beyond Mapper and Reducer
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Index Values NATIONAL STATISTICAL COORDINATION BOARD.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Lecture 6 Sept 15, 09 Goals: two-dimensional arrays matrix operations circuit analysis using Matlab image processing – simple examples.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de Data-Parallel.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Output Grouping-Based Decomposition of Logic Functions Petr Fišer, Hana Kubátová Department of Computer Science and Engineering Czech Technical University.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
MapReduce. Google and MapReduce Google searches billions of web pages very, very quickly How? It uses a technique called “MapReduce” to distribute the.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Unit 1: Matrices Day 1 Aug. 7th, 2012.
Scaling SQL with different approaches
Section 7: Memory and Caches
File System Structure How do I organize a disk into a file system?
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Linchuan Chen, Peng Jiang and Gagan Agrawal
MapReduce.
Managing batch processing Transient Azure SQL Warehouse Resource
Digital Image Processing
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
CS639: Data Management for Data Science
Matrices An appeaser is one who feeds a crocodile—hoping it will eat him last. Winston Churchhill.
COMP755 Advanced Operating Systems
CS639: Data Management for Data Science
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Matrix Multiplication in Hadoop Siddharth Saraph

Matrix Multiplication Matrices are very practical: sciences, engineering, statistics, etc. Multiplication is a fundamental nontrivial matrix operation. Simpler than something like matrix inversion (although the time complexity is the same).

Matrix Multiplication Problem: Some people want to use enormous matrices. Cannot be handled on one machine Take advantage of map-reduce parallelism to approach this problem. Heuristics: 10,000x10,000: 100,000,000 entries 100,000x100,000: 10,000,000,000 entries In practice: sparse matrices.

First Step: Matrix Representation How to represent a matrix for input to a map-reduce job? Convenient for sparse matrices: “Coordinate list format.” (row index, col index, value). (Board). Omit entries with value 0. Entries can be in an arbitrary order.

Second Step: Map Reduce Algorithm Simple “entrywise” method. Various related block methods: matrices are partitioned into smaller blocks, and logically processed as blocks. An excess of notation and indices to keep track of, easy to get lost.

Second Step: Map Reduce Algorithm Chalkboard.

Implementation Java, not Hadoop streaming. Why? Seemed like a more complex project that would require more control. Custom Key and Value classes. Custom Partitioner class for the block method, for distributing keys to reducers. Learn java.

Performance Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000). Many parameters to vary: matrix dimensions, double max, number of splits, number of reducers, density of matrix Sparse 1000x1000, .1, 6 splits, 12 reducers, 2.9MB: 5 minutes Sparse 5000x5000, .1, 20 splits, 20 reducers, 73MB: > 1 Hour

MATLAB Performance Windows 7, MATLAB 2015a 64-bit. Engineering Library cluster, 4 GB RAM: 13,000x13,000 about largest that could fit in memory. Full random matrices of doubles. Multiplication time: about 2 minutes. LaFortune cluster, 16 GB RAM: 20,000x20,000 .1 density, sparse matrix. Multiplication time: about 2 minutes 30 seconds.

Improvements? Different matrix representation? Maybe there are better ways to represent sparse matrices than Coordinate List format. Strassen’s algorithm? O(n2.8), benefits of about 10% with matrix dimensions of few thousand. Use a different algorithm? Use a different platform? Spark?

Conclusion What happened to the enormous matrices? From my project, I do not think Hadoop is a practical choice for implementing matrix multiplication. I did not find any implementations of matrix multiplication in Hadoop that provide significant benefit over local machines.