Presentation is loading. Please wait.

Presentation is loading. Please wait.

Matrix Multiplication in Hadoop

Similar presentations


Presentation on theme: "Matrix Multiplication in Hadoop"— Presentation transcript:

1 Matrix Multiplication in Hadoop
Siddharth Saraph

2 Matrix Multiplication
Matrices are very practical: sciences, engineering, statistics, etc. Multiplication is a fundamental nontrivial matrix operation. Simpler than something like matrix inversion (although the time complexity is the same).

3 Matrix Multiplication
Problem: Some people want to use enormous matrices. Cannot be handled on one machine Take advantage of map-reduce parallelism to approach this problem. Heuristics: 10,000x10,000: 100,000,000 entries 100,000x100,000: 10,000,000,000 entries In practice: sparse matrices.

4 First Step: Matrix Representation
How to represent a matrix for input to a map-reduce job? Convenient for sparse matrices: “Coordinate list format.” (row index, col index, value). (Board). Omit entries with value 0. Entries can be in an arbitrary order.

5 Second Step: Map Reduce Algorithm
Simple “entrywise” method. Various related block methods: matrices are partitioned into smaller blocks, and logically processed as blocks. An excess of notation and indices to keep track of, easy to get lost.

6 Second Step: Map Reduce Algorithm
Chalkboard.

7 Implementation Java, not Hadoop streaming. Why?
Seemed like a more complex project that would require more control. Custom Key and Value classes. Custom Partitioner class for the block method, for distributing keys to reducers. Learn java.

8 Performance Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000). Many parameters to vary: matrix dimensions, double max, number of splits, number of reducers, density of matrix Sparse 1000x1000, .1, 6 splits, 12 reducers, 2.9MB: 5 minutes Sparse 5000x5000, .1, 20 splits, 20 reducers, 73MB: > 1 Hour

9 MATLAB Performance Windows 7, MATLAB 2015a 64-bit.
Engineering Library cluster, 4 GB RAM: 13,000x13,000 about largest that could fit in memory. Full random matrices of doubles. Multiplication time: about 2 minutes. LaFortune cluster, 16 GB RAM: 20,000x20, density, sparse matrix. Multiplication time: about 2 minutes 30 seconds.

10 Improvements? Different matrix representation? Maybe there are better ways to represent sparse matrices than Coordinate List format. Strassen’s algorithm? O(n2.8), benefits of about 10% with matrix dimensions of few thousand. Use a different algorithm? Use a different platform? Spark?

11 Conclusion What happened to the enormous matrices?
From my project, I do not think Hadoop is a practical choice for implementing matrix multiplication. I did not find any implementations of matrix multiplication in Hadoop that provide significant benefit over local machines.


Download ppt "Matrix Multiplication in Hadoop"

Similar presentations


Ads by Google