Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.

Similar presentations


Presentation on theme: "Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku."— Presentation transcript:

1 Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku

2  Introduction  Related Work  Distributed Mining Process  Co-clustering Huge Datasets  Experiments  Conclusions

3  Problems  Huge datasets  Natural sources of data are impure form  Proposed Method  A comprehensive Distributed Co-clustering (DisCo) solution  Using Hadoop  DisCo is a scalable framework under which various co-clustering algorithms can be implemented

4  Map-Reduce framework  employs a distributed storage cluster  block-addressable storage  a centralized metadata server  a convenient data access  storage API for Map-Reduce tasks

5  Co-clustering  Algorithm cluster shapes checkerboard partitions single bi-cluster Exclusive row and column partitions overlapping partitions  Optimization criteria code length

6 Identifying the source and obtaining the data Transform raw data into the appropriate format for data analysis Visual results, or turned into the input for other applications.

7  Data pre-processing  Processing 350 GB raw network event log Needs over 5 hours to extract source/destination IP pairs  Achieve much better performance on a few commodity nodes running Hadoop  Setting up Hadoop required minimal effort

8  Specifically for co-clustering, there are two main preprocessing tasks: Building the graph from raw data Pre-computing the transpose  During co-clustering optimization, we need to iterate over both rows and columns.  Need to pre-compute the adjacency lists for both the original graph as well as its transpose

9  Definitions and overview  Matrices are denoted by boldface capital letters  Vectors are denoted by boldface lowercase letters  a ij :the (i, j)-th element of matrix A  Co-clustering algorithms employs a checkerboard the original adjacency matrix  a grid of sub- matrices  An m x n matrix, a co-clustering is a pair of row and column labeling vectors  r(i):the i-th row of the matrix  G: the k×ℓ group matrix A A a a

10  g pq gives the sufficient statistics for the (p, q) sub-matrix

11  Map function

12  Reduce function

13  Global sync

14  Setup  39 nodes  Two dual-core processors  8GM RAM  Linux RHEL4  4Gbps Ethernets  SATA, 65MB/sec or roughly 500 Mbps  The total capacity of our HDFS cluster was just 2.4 terabytes  HDFS block size was set to 64MB (default value)  JAVA  Sun JDK version 1.6.0_03

15  The pre-processing step on the ISS data  Default values  39 nodes  6 concurrent maps per node  5 reduce tasks  256MB input split size

16

17  Using relatively low-cost components  I/O rates that exceed those of high-performance storage systems.  Performance scales almost linearly with the number of machines/disks.


Download ppt "Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku."

Similar presentations


Ads by Google