Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing.

Similar presentations


Presentation on theme: "How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing."— Presentation transcript:

1 How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools

2 The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1.Arkadas Ozakin: Research scientist, PhD Theoretical Physics 2.Dong Ryeol Lee: PhD student, CS + Math 3.Ryan Riegel: PhD student, CS + Math 4.Parikshit Ram: PhD student, CS + Math 5.William March: PhD student, Math + CS 6.James Waters: PhD student, Physics + CS 7.Hua Ouyang: PhD student, CS 8.Sooraj Bhat: PhD student, CS 9.Ravi Sastry: PhD student, CS 10.Long Tran: PhD student, CS 11.Michael Holmes: PhD student, CS + Physics (co-supervised) 12.Nikolaos Vasiloglou: PhD student, EE (co-supervised) 13.Wei Guan: PhD student, CS (co-supervised) 14.Nishant Mehta: PhD student, CS (co-supervised) 15.Wee Chin Wong: PhD student, ChemE (co-supervised) 16.Abhimanyu Aditya: MS student, CS 17.Yatin Kanetkar: MS student, CS 18.Praveen Krishnaiah: MS student, CS 19.Devika Karnik: MS student, CS 20.Prasad Jakka: MS student, CS

3 Allow users to apply all the state-of- the-art statistical methods… ….with orders-of-magnitude more computational efficiency –Via: Fast algorithms + Distributed computing Our mission

4 The problem: big datasets D N M Could be large: N (#data), D (#features), M (#models)

5 Core methods of statistics / machine learning / mining Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(D 3 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking 2-sample testing: n-point correlation O(N n ) Cross-match: bipartite matching O(N 3 )

6 5 main computational bottlenecks: Aggregations, GNPs, graphical models, linear algebra, optimization Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal range-search O(N), contingency table Density estimation: kernel density estimation O(N 2 ), mixture of Gaussians O(N) Regression: linear regression O(D 3 ), kernel regression O(N 2 ), Gaussian process regression O(N 3 ) Classification: nearest-neighbor classifier O(N 2 ), nonparametric Bayes classifier O(N 2 ), support vector machine Dimension reduction: principal component analysis O(D 3 ), non-negative matrix factorization, kernel PCA O(N 3 ), maximum variance unfolding O(N 3 ) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(N), hierarchical clustering O(N 3 ), by dimension reduction Time series analysis: Kalman filter O(D 3 ), hidden Markov model, trajectory tracking 2-sample testing: n-point correlation O(N n ) Cross-match: bipartite matching O(N 3 )

7 Multi-resolution data structures e.g. kd-trees [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] How can we compute this efficiently?

8 A kd-tree: level 1

9 A kd-tree: level 2

10 A kd-tree: level 3

11 A kd-tree: level 4

12 A kd-tree: level 5

13 A kd-tree: level 6

14 Computational complexity using fast algorithms Querying: nearest-neighbor O(logN), spherical range-search O(logN), orthogonal range-search O(logN), contingency table Density estimation: kernel density estimation O(N) or O(1), mixture of Gaussians O(logN) Regression: linear regression O(D) or O(1), kernel regression O(N) or O(1), Gaussian process regression O(N) or O(1) Classification: nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N), support vector machine Dimension reduction: principal component analysis O(D) or O(1), non- negative matrix factorization, kernel PCA O(N) or O(1), maximum variance unfolding O(N) Outlier detection: by robust L 2 estimation, by density estimation, by dimension reduction Clustering: k-means O(logN), hierarchical clustering O(NlogN), by dimension reduction Time series analysis: Kalman filter O(D) or O(1), hidden Markov model, trajectory tracking 2-sample testing: n-point correlation O(N logn ) Cross-match: bipartite matching O(N) or O(1)

15 Ex: 3-point correlation runtime (biggest previous: 20K) VIRGO simulation data, N = 75,000,000 naïve: 5x10 9 sec. (~150 years) multi-tree: 55 sec. (exact) n=2: O(N) n=3: O(N log3 ) n=4: O(N 2 )

16 Our upcoming products MLPACK (C++) Dec. 2008 –First scalable comprehensive ML library THOR (distributed) Apr. 2009 –Faster tree-based “MapReduce” for analytics MLPACK-db Apr. 2009 –fast data analytics in relational databases (SQL Server)

17 The end It’s possible to scale up all statistical methods …with smart algorithms, not just brute force Look for MLPACK (C++), THOR (distributed), MLPACK-db (RDBMS) Alexander Gray agray@cc.gatech.eduagray@cc.gatech.edu (email is best; webpage sorely out of date)

18 Range-count recursive algorithm

19

20

21

22 Pruned! (inclusion) Range-count recursive algorithm

23

24

25

26

27

28

29

30 Pruned! (exclusion) Range-count recursive algorithm

31

32

33 fastest practical algorithm [Bentley 1975] our algorithms can use any tree Range-count recursive algorithm


Download ppt "How to do Fast Analytics on Massive Datasets Alexander Gray Georgia Institute of Technology Computational Science and Engineering College of Computing."

Similar presentations


Ads by Google