Data Mining Runtime Software and Algorithms BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 January 26 2015.

Data Mining Runtime Software and Algorithms BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 January 26 2015 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington 1/26/20151

Parallel Data Analytics Streaming algorithms have interesting differences but “Batch” Data analytics is “just parallel computing” with usual features such as SPMD and BSP Static Regular problems are straightforward but Dynamic Irregular Problems are technically hard and high level approaches fail (see High Performance Fortran HPF) – Regular meshes worked well but – Adaptive dynamic meshes did not although “real people with MPI” could parallelize Using libraries is successful at either – Lowest: communication level – Higher: “core analytics” level Data analytics does not yet have “good regular parallel libraries” 1/26/20152

Iterative MapReduce Implementing HPC-ABDS Judy Qiu, Bingjing Zhang, Dennis Gannon, Thilina Gunarathne 1/26/20153

Why worry about Iteration? Key analytics fit MapReduce and do NOT need improvements – in particular iteration. These are – Search (as in Bing, Yahoo, Google) – Recommender Engines as in e-commerce (Amazon, Netflix) – Alignment as in BLAST for Bioinformatics However most datamining like deep learning, clustering, support vector requires iteration and cannot be done in a single Map-Reduce step – Communicating between steps via disk as done in Hadoop implenentations, is far too slow – So cache data (both basic and results of collective computation) between iterations. 1/26/20154

Using Optimal “Collective” Operations Twister4Azure Iterative MapReduce with enhanced collectives – Map-AllReduce primitive and MapReduce-MergeBroadcast Test on Hadoop (Linux) for Strong and Weak Scaling on K-means for up to 256 cores Hadoop vs H-Collectives Map-AllReduce. 500 Centroids (clusters). 20 Dimensions. 10 iterations. 1/26/20155

Kmeans and (Iterative) MapReduce Shaded areas are computing only where Hadoop on HPC cluster is fastest Areas above shading are overheads where T4A smallest and T4A with AllReduce collective have lowest overhead Note even on Azure Java (Orange) faster than T4A C# for compute 61/26/2015

Harp Design Parallelism ModelArchitecture Shuffle M M MM Optimal Communication M M MM RR Map-Collective or Map- Communication Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications Map-Collective or Map- Communication Applications Application Framework Resource Manager

Features of Harp Hadoop Plugin Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions (will extend to Point to Point) Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with checkpointing

WDA SMACOF MDS (Multidimensional Scaling) using Harp on IU Big Red 2 Parallel Efficiency: on 100-300K sequences Conjugate Gradient (dominant time) and Matrix Multiplication Best available MDS (much better than that in R) Java Harp (Hadoop plugin) Cores =32 #nodes

Increasing Communication Identical Computation Mahout and Hadoop MR – Slow due to MapReduce Python slow as Scripting; MPI fastest Spark Iterative MapReduce, non optimal communication Harp Hadoop plug in with ~MPI collectives

Parallel Tweet Clustering with Storm Judy Qiu and Xiaoming Gao Storm Bolts coordinated by ActiveMQ to synchronize parallel cluster center updates – add loops to Storm 2 million streaming tweets processed in 40 minutes; 35,000 clusters Sequential Parallel – eventually 10,000 bolts 1/26/201511

Parallel Tweet Clustering with Storm Speedup on up to 96 bolts on two clusters Moe and Madrid Red curve is old algorithm; green and blue new algorithm Full Twitter – 1000 way parallelism Full Everything – 10,000 way parallelism 1/26/201512

Data Analytics in SPIDAL

Analytics and the DIKW Pipeline Data goes through a pipeline Raw data  Data  Information  Knowledge  Wisdom  Decisions Each link enabled by a filter which is “business logic” or “analytics” We are interested in filters that involve “sophisticated analytics” which require non trivial parallel algorithms – Improve state of art in both algorithm quality and (parallel) performance Design and Build SPIDAL (Scalable Parallel Interoperable Data Analytics Library) More Analytics Knowledge Information Analytics Information Data

Strategy to Build SPIDAL Analyze Big Data applications to identify analytics needed and generate benchmark applications Analyze existing analytics libraries (in practice limit to some application domains) – catalog library members available and performance – Mahout low performance, R largely sequential and missing key algorithms, MLlib just starting Identify big data computer architectures Identify software model to allow interoperability and performance Design or identify new or existing algorithm including parallel implementation Collaborate application scientists, computer systems and statistics/algorithms communities

Machine Learning in Network Science, Imaging in Computer Vision, Pathology, Polar Science, Biomolecular Simulations 16 AlgorithmApplicationsFeaturesStatusParallelism Graph Analytics Community detectionSocial networks, webgraph Graph. P-DMGML-GrC Subgraph/motif findingWebgraph, biological/social networksP-DMGML-GrB Finding diameterSocial networks, webgraphP-DMGML-GrB Clustering coefficientSocial networksP-DMGML-GrC Page rankWebgraphP-DMGML-GrC Maximal cliquesSocial networks, webgraphP-DMGML-GrB Connected componentSocial networks, webgraphP-DMGML-GrB Betweenness centralitySocial networks Graph, Non-metric, static P-Shm GML-GRA Shortest pathSocial networks, webgraphP-Shm Spatial Queries and Analytics Spatial relationship based queries GIS/social networks/pathology informatics Geometric P-DMPP Distance based queriesP-DMPP Spatial clusteringSeqGML Spatial modelingSeqPP GML Global (parallel) ML GrA Static GrB Runtime partitioning

Some specialized data analytics in SPIDAL aa 17 AlgorithmApplicationsFeaturesStatusParallelism Core Image Processing Image preprocessing Computer vision/pathology informatics Metric Space Point Sets, Neighborhood sets & Image features P-DMPP Object detection & segmentation P-DMPP Image/object feature computation P-DMPP 3D image registrationSeqPP Object matching Geometric TodoPP 3D feature extractionTodoPP Deep Learning Learning Network, Stochastic Gradient Descent Image Understanding, Language Translation, Voice Recognition, Car driving Connections in artificial neural net P-DMGML PP Pleasingly Parallel (Local ML) Seq Sequential Available GRA Good distributed algorithm needed Todo No prototype Available P-DM Distributed memory Available P-Shm Shared memory Available

Some Core Machine Learning Building Blocks 18 AlgorithmApplicationsFeaturesStatus//ism DA Vector Clustering Accurate ClustersVectorsP-DMGML DA Non metric Clustering Accurate Clusters, Biology, WebNon metric, O(N 2 )P-DMGML Kmeans; Basic, Fuzzy and Elkan Fast ClusteringVectorsP-DMGML Levenberg-Marquardt Optimization Non-linear Gauss-Newton, use in MDS Least SquaresP-DMGML SMACOF Dimension Reduction DA- MDS with general weights Least Squares, O(N 2 ) P-DMGML Vector Dimension Reduction DA-GTM and OthersVectorsP-DMGML TFIDF Search Find nearest neighbors in document corpus Bag of “words” (image features) P-DMPP All-pairs similarity search Find pairs of documents with TFIDF distance below a threshold TodoGML Support Vector Machine SVM Learn and ClassifyVectorsSeqGML Random Forest Learn and ClassifyVectorsP-DMPP Gibbs sampling (MCMC) Solve global inference problemsGraphTodoGML Latent Dirichlet Allocation LDA with Gibbs sampling or Var. Bayes Topic models (Latent factors)Bag of “words”P-DMGML Singular Value Decomposition SVD Dimension Reduction and PCAVectorsSeqGML Hidden Markov Models (HMM) Global inference on sequence models VectorsSeq PP & GML

Parallel Data Mining

Remarks on Parallelism I Most use parallelism over items in data set – Entities to cluster or map to Euclidean space Except deep learning (for image data sets)which has parallelism over pixel plane in neurons not over items in training set – as need to look at small numbers of data items at a time in Stochastic Gradient Descent SGD – Need experiments to really test SGD – as no easy to use parallel implementations tests at scale NOT done – Maybe got where they are as most work sequential 20

Remarks on Parallelism II Maximum Likelihood or  2 both lead to structure like Minimize sum  items=1 N (Positive nonlinear function of unknown parameters for item i) All solved iteratively with (clever) first or second order approximation to shift in objective function – Sometimes steepest descent direction; sometimes Newton – 11 billion deep learning parameters; Newton impossible – Have classic Expectation Maximization structure – Steepest descent shift is sum over shift calculated from each point SGD – take randomly a few hundred of items in data set and calculate shifts over these and move a tiny distance – Classic method – take all (millions) of items in data set and move full distance 21

Remarks on Parallelism III Need to cover non vector semimetric and vector spaces for clustering and dimension reduction (N points in space) MDS Minimizes Stress  (X) =  i<j =1 N weight(i,j) (  (i, j) - d(X i, X j )) 2 Semimetric spaces just have pairwise distances defined between points in space  (i, j) Vector spaces have Euclidean distance and scalar products – Algorithms can be O(N) and these are best for clustering but for MDS O(N) methods may not be best as obvious objective function O(N 2 ) – Important new algorithms needed to define O(N) versions of current O(N 2 ) – “must” work intuitively and shown in principle Note matrix solvers all use conjugate gradient – converges in 5-100 iterations – a big gain for matrix with a million rows. This removes factor of N in time complexity Ratio of #clusters to #points important; new ideas if ratio >~ 0.1 22

Structure of Parameters Note learning networks have huge number of parameters (11 billion in Stanford work) so that inconceivable to look at second derivative Clustering and MDS have lots of parameters but can be practical to look at second derivative and use Newton’s method to minimize Parameters are determined in distributed fashion but are typically needed globally – MPI use broadcast and “AllCollectives” – AI community: use parameter server and access as needed 23

Robustness from Deterministic Annealing Deterministic annealing smears objective function and avoids local minima and being much faster than simulated annealing Clustering – Vectors: Rose (Gurewitz and Fox) 1990 – Clusters with fixed sizes and no tails (Proteomics team at Broad) – No Vectors: Hofmann and Buhmann (Just use pairwise distances) Dimension Reduction for visualization and analysis – Vectors: GTM Generative Topographic Mapping – No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances) Can apply to HMM & general mixture models (less study) – Gaussian Mixture Models – Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation for finding “hidden factors”

More Efficient Parallelism The canonical model is correct at start but each point does not really contribute to each cluster as damped exponentially by exp( - (X i - Y(k)) 2 /T ) For Proteomics problem, on average only 6.45 clusters needed per point if require (X i - Y(k)) 2 /T ≤ ~40 (as exp(-40) small) So only need to keep nearby clusters for each point As average number of Clusters ~ 20,000, this gives a factor of ~3000 improvement Further communication is no longer all global; it has nearest neighbor components and calculated by parallelism over clusters Claim that ~all O(N 2 ) machine learning algorithms can be done in O(N)logN using ideas as in fast multipole (Barnes Hut) for particle dynamics – ~0 use in practice 25

SPIDAL EXAMPLES

The brownish triangles are stray peaks outside any cluster. The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center 27 Fragment of 30,000 Clusters 241605 Points

“Divergent” Data Sample 23 True Sequences 28 CDhit UClust Divergent Data Set UClust (Cuts 0.65 to 0.95) DAPWC 0.65 0.75 0.85 0.95 Total # of clusters 23 4 10 36 91 Total # of clusters uniquely identified 23 0 0 13 16 (i.e. one original cluster goes to 1 uclust cluster ) Total # of shared clusters with significant sharing 0 4 10 5 0 (one uclust cluster goes to > 1 real cluster) Total # of uclust clusters that are just part of a real cluster 0 4 10 17(11) 72(62) (numbers in brackets only have one member) Total # of real clusters that are 1 uclust cluster 0 14 9 5 0 but uclust cluster is spread over multiple real clusters Total # of real clusters that have 0 9 14 5 7 significant contribution from > 1 uclust cluster DA-PWC

Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters 29

Heatmap of biology distance (Needleman- Wunsch) vs 3D Euclidean Distances 30 If d a distance, so is f(d) for any monotonic f. Optimize choice of f

446K sequences ~100 clusters

MDS gives classifying cluster centers and existing sequences for Fungi nice 3D Phylogenetic trees

O(N 2 ) interactions between green and purple clusters should be able to represent by centroids as in Barnes-Hut. Hard as no Gauss theorem; no multipole expansion and points really in 1000 dimension space as clustered before 3D projection O(N 2 ) green-green and purple- purple interactions have value but green-purple are “wasted” “clean” sample of 446K

34 Use Barnes Hut OctTree, originally developed to make O(N 2 ) astrophysics O(NlogN), to give similar speedups in machine learning

35 OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation (streaming data)

Algorithm Challenges See NRC Massive Data Analysis report O(N) algorithms for O(N 2 ) problems Parallelizing Stochastic Gradient Descent Streaming data algorithms – balance and interplay between batch methods (most time consuming) and interpolative streaming methods Graph algorithms – need shared memory? Machine Learning Community uses parameter servers; Parallel Computing (MPI) would not recommend this? – Is classic distributed model for “parameter service” better? Apply best of parallel computing – communication and load balancing – to Giraph/Hadoop/Spark Are data analytics sparse?; many cases are full matrices BTW Need Java Grande – Some C++ but Java most popular in ABDS, with Python, Erlang, Go, Scala (compiles to JVM) …..

Some Futures Always run MDS. Gives insight into data – Leads to a data browser as GIS gives for spatial data Claim is algorithm change gave as much performance increase as hardware change in simulations. Will this happen in analytics? – Today is like parallel computing 30 years ago with regular meshs. We will learn how to adapt methods automatically to give “multigrid” and “fast multipole” like algorithms Need to start developing the libraries that support Big Data – Understand architectures issues – Have coupled batch and streaming versions – Develop much better algorithms Please join SPIDAL (Scalable Parallel Interoperable Data Analytics Library) community 37

Java Grande

We once tried to encourage use of Java in HPC with Java Grande Forum but Fortran, C and C++ remain central HPC languages. – Not helped by.com and Sun collapse in 2000-2005 The pure Java CartaBlanca, a 2005 R&D100 award-winning project, was an early successful example of HPC use of Java in a simulation tool for non-linear physics on unstructured grids. Of course Java is a major language in ABDS and as data analysis and simulation are naturally linked, should consider broader use of Java Using Habanero Java (from Rice University) for Threads and mpiJava or FastMPJ for MPI, gathering collection of high performance parallel Java analytics – Converted from C# and sequential Java faster than sequential C# So will have either Hadoop+Harp or classic Threads/MPI versions in Java Grande version of Mahout

Performance of MPI Kernel Operations Pure Java as in FastMPJ slower than Java interfacing to C version of MPI

Java Grande and C# on 40K point DAPWC Clustering Very sensitive to threads v MPI 64 Way parallel 128 Way parallel 256 Way parallel TXP Nodes Total C# Java C# Hardware 0.7 performance Java Hardware

Java and C# on 12.6K point DAPWC Clustering Java C# #Threads x #Processes per node # Nodes Total Parallelism Time hours 1x1 2x2 1x21x4 2x1 1x8 4x1 2x4 4x2 8x1 #Threads x #Processes per node C# Hardware 0.7 performance Java Hardware

Data Mining Runtime Software and Algorithms BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 January 26 2015.

Similar presentations

Presentation on theme: "Data Mining Runtime Software and Algorithms BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 January 26 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Runtime Software and Algorithms BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 January 26 2015.

Similar presentations

Presentation on theme: "Data Mining Runtime Software and Algorithms BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 January 26 2015."— Presentation transcript:

Similar presentations

About project

Feedback