Matrix Multiplication in Hadoop Siddharth Saraph
Matrix Multiplication Matrices are very practical: sciences, engineering, statistics, etc. Multiplication is a fundamental nontrivial matrix operation. Simpler than something like matrix inversion (although the time complexity is the same).
Matrix Multiplication Problem: Some people want to use enormous matrices. Cannot be handled on one machine Take advantage of map-reduce parallelism to approach this problem. Heuristics: 10,000x10,000: 100,000,000 entries 100,000x100,000: 10,000,000,000 entries In practice: sparse matrices.
First Step: Matrix Representation How to represent a matrix for input to a map-reduce job? Convenient for sparse matrices: “Coordinate list format.” (row index, col index, value). (Board). Omit entries with value 0. Entries can be in an arbitrary order.
Second Step: Map Reduce Algorithm Simple “entrywise” method. Various related block methods: matrices are partitioned into smaller blocks, and logically processed as blocks. An excess of notation and indices to keep track of, easy to get lost.
Second Step: Map Reduce Algorithm Chalkboard.
Implementation Java, not Hadoop streaming. Why? Seemed like a more complex project that would require more control. Custom Key and Value classes. Custom Partitioner class for the block method, for distributing keys to reducers. Learn java.
Performance Random matrix generator: row dimension, column dimension, density. Doubles in (-1000, 1000). Many parameters to vary: matrix dimensions, double max, number of splits, number of reducers, density of matrix Sparse 1000x1000, .1, 6 splits, 12 reducers, 2.9MB: 5 minutes Sparse 5000x5000, .1, 20 splits, 20 reducers, 73MB: > 1 Hour
MATLAB Performance Windows 7, MATLAB 2015a 64-bit. Engineering Library cluster, 4 GB RAM: 13,000x13,000 about largest that could fit in memory. Full random matrices of doubles. Multiplication time: about 2 minutes. LaFortune cluster, 16 GB RAM: 20,000x20,000 .1 density, sparse matrix. Multiplication time: about 2 minutes 30 seconds.
Improvements? Different matrix representation? Maybe there are better ways to represent sparse matrices than Coordinate List format. Strassen’s algorithm? O(n2.8), benefits of about 10% with matrix dimensions of few thousand. Use a different algorithm? Use a different platform? Spark?
Conclusion What happened to the enormous matrices? From my project, I do not think Hadoop is a practical choice for implementing matrix multiplication. I did not find any implementations of matrix multiplication in Hadoop that provide significant benefit over local machines.