Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Gamma Operator for Big Data Summarization

Similar presentations


Presentation on theme: "The Gamma Operator for Big Data Summarization"— Presentation transcript:

1 The Gamma Operator for Big Data Summarization
on an Array DBMS Carlos Ordonez

2 Acknowledgments Michael Stonebraker , MIT
My PhD students: Yiqun Zhang, Wellington Cabrera SciDB team: Paul Brown, Bryan Lewis, Alex Polyakov

3 Why SciDB? Large matrices beyond RAM size
Storage by row or column not good enough Matrices natural in statistics, engineer. and science Multidimensional arrays -> matrices, not same thing Parallel shared-nothing best for big data analytics Closer to DBMS technology, but some similarity with Hadoop Feasible to create array operators, having matrices as input and matrix as output Combine processing with R package and LAPACK

4

5 Old: separate sufficient statistics

6 New: Generalizing and unifying Sufficient Statistics: Z=[1,X,Y]

7 Equivalent equations with projections from Γ

8 Properties of 

9 Further properties details: non-commutative and distributive

10 Storage in array chunks

11 In SciDB we store the points in X as 2D array.
SCAN 1 2 d Worker

12 Array storage and processing in SciDB
Assuming d<<n it is natural to hash partition X by i=1..n Gamma computation is fully parallel maintaining local Gamma versions in RAM. X can be read with a fully parallel scan No need to write Gamma from RAM to disk during scan, unless fault tolerant

13 Point must fit in one chunk. Otherwise, join is needed (slow)
1 2 d 1 2 d OK NO! Coordinator 1 2 d Coordinator Worker 1 Worker 1

14 Parallel computation Coordinator Worker 1 Worker 2 1 2 d 1 2 d 1 2 d
Coordinator Worker 1 Worker 2 send send

15 Dense matrix operator: O(d2 n)

16 Sparse matrix operator: O(d n) for hyper-sparse matrix

17 Pros: Algorithm evaluation with physical array operators
Since xi fits in one chunk joins are avoided (at least 2X I/O with hash or merge join) Since xi*xiT can be computed in RAM we avoid an aggregation which would require sorting points by i No need to store X twice: X, XT: half I/O, half RAM space No need transpose X, costly reorganization even in RAM, especially if X spans several RAM segments Operator works in C++ compiled code: fast; vector accessed once; direct assignment (bypass C++ functions calls)

18 System issues and limitations
Gamma not efficiently computable in AQL or AFL: hence operator is required Arrays of tuples in SciDB are more general, but cumbersome for matrix manipulation: arrays of single attribute (double) Points must be stored completely inside a chunk: wide rectangular chunks: may not be I/O optimal Slow: Arrays must be pre-processed to SciDB load format, loaded to 1D array and re-dimensioned=>optimize load. Multiple SciDB instances per node improve I/O speed: interleaving CPU Larger chunks are better: 8MB, especially for dense matrices; avoid shuffling; avoid joins Dense (alpha) and sparse (beta) versions

19 Benchmark: scale up emphasis
Small: cluster with 2 Intel Quadcore servers 4GB RAM, 3TB disk Large: Amazon cloud 2

20

21 Why is Gamma faster than SciDB+LAPACK?
Gamma operator d Gamma op Scan mem alloc CPU merge 100 3.5 0.7 0.1 2.2 0.0 200 10.9 1.0 8.6 400 38.8 33.9 800 145.0 4.6 134.7 0.4 1600 599.8 11.4 575.5 SciDB and LAPACK (crossprod() call in SciDB) TOTAL transpose subarray 1 repart 1 subarray 2 repart 2 build 0s gemm ScaLAPACK MKL 77.3 0.3 41.7 25.9 8.0 0.8 0.2 163.0 84.9 55.7 17.2 1.8 0.6 373.1 172.6 0.5 120.6 39.4 5.4 2.1 1497.3 553.6 537.6 169.8 21.2 8.1 * 33.4

22 Combination: SciDB + R

23 Can Gamma operator beat LAPACK?
Gamma versus Open BLAS LAPACK (90% performance of MKL) Gamma: scan, sparse/dense 2 threads; disk+RAM+CPU LAPACK: Open BLAS~=MKL; 2 threads; RAM+CPU d=100 LAPACK d=200 d=400 d=800 n density dense sparse Op BLAS Op BLAS2 Open BLAS 100k 0.1% 3.3 0.1 0.4 11.3 1.0 38.9 0.2 3.1 145.0 0.6 10.7 1.0% 10.0% 0.5 0.9 2.2 6.2 100.0% 4.5 15.4 55.9 201.0 1M 31.1 3.8 103.5 10.0 316.5 423.2 1475.7 fail 1.1 4.0 7.0 16.3 46.4 44.0 148.8 542.3 2159.6

24 SciDB in the Cloud: massive parallelism

25 Conclusions One pass summarization matrix operator: parallel, scalable
Optimization of outer matrix multiplication as sum (aggregation) of vector outer products Dense and sparse matrix versions required Operator compatible with any parallel shared-nothing system, but better for arrays Gamma matrix must fit in RAM, but n unlimited Summarization matrix can be exploited in many intermediate computations (with appropriate projections) in linear models Simplifies many methods to two phases: Summarization Computing model parameters Requires arrays, but can work with SQL or MapReduce

26 Future work: Theory Use Gamma in other models like logistic regression, clustering, Factor Analysis, HMMs Connection to frequent itemset Sampling Higher expected moments, co-variates Unlikely: Numeric stability with unnormalized sorted data

27 Future work: Systems DONE: Sparse matrices: layout, compression
DONE: Beat LAPACK on high d Online model learning (cursor interface needed, incompatible with DBMS) Unlimited d (currently d>8000); join required for high d? Parallel processing of high d more complicated, chunked Interface with BLAS and MKL, not worth it? Faster than column DBMS for sparse?


Download ppt "The Gamma Operator for Big Data Summarization"

Similar presentations


Ads by Google