Download presentation
Presentation is loading. Please wait.
Published byAugust Curtis Modified over 9 years ago
1
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 0356169 吳宏君 0350741 陳威遠 0356042 洪浩哲
2
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 1
3
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 2
4
Introduction Background Goal Map-Reduce 3
5
Background Huge datasets are becoming prevalent – Real-world application produce huge volumes of messy data (terabytes, or more) – pre-processing the raw data is important Map-reduce Tool – A simple but powerful execution engine – Unconcerned about data models and storage schemes 4
6
Goal Focus on co-clustering or bi-clustering of pairwise relationships from the raw data – Co-clustering searches for matrices of rows and columns that are inter-related Proposes a comprehensive Distributed Co- clustering (DisCo) solution from raw data to the end clusters. – Which involves data gathering, pre-processing, analysis, and presentation – Apply Map-Reduce(Hadoop) machine both as a programming model and implementation testbed. 5
7
Map-Reduce Distributed, scalable, fault-tolerant data storage, management and processing tools – Distributed execution engine for select-project via sequential scan, followed by hashed partitioning and sort-merge group-by. – Suited for data already stored on a distributed file system – Map-Reduce can transparently use any number of machines 6
8
Map-Reduce 7
9
8
10
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 9
11
Distributed Mining Process 10
12
Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose Extract SrcIP + DstIP and build adjacency matrix 11
13
Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose During co-clustering optimization, we need to iterate over both rows and columns. Pre-compute the adjacency lists for both the original graph as well as its transpose 12
14
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 13
15
Co-clustering Definitions and overview – Co-clustering allows simultaneous clustering of the rows and columns of a matrix – Input format: a matrix of m-rows and n-columns – co-clustering algorithm employs a checkerboard the original adjacency matrix → a grid of sub- matrices 14
16
Co-clustering 15
17
Co-clustering Goal – Find good group assignment vectors such that error function is minimized. 16
18
Co-clustering 17
19
Co-clustering 01011 10100 01011 10100 Co-clustering(A,k,l) k=2l=2 A= 18
20
Co-clustering 01011 10100 01011 10100 c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)=1 19
21
Co-clustering 01011 10100 01011 10100 c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)=1 01011 01011 10100 10100 r(2)=2 20
22
Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 00111 00111 11000 11000 01011 01011 10100 10100 c(2)=2 21
23
Co-clustering Co-clustering with Map-Reduce One iteration over rows as a Map-Reduce job 22
24
Co-clustering Co-clustering with Map-Reduce 23
25
Co-clustering 01011 10100 01011 10100 Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce r, c, G random initialization based on parameter k, l 24
26
Co-clustering 01011 10100 01011 10100 Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce k=2, l=2 r = { 1,1,1,2} c = {1,1,1,2,2} 25
27
Co-clustering Co-clustering with Map-Reduce Fix column Row iteration 1 -> 2,4,5 ( Key, value)=(1,{2,4,5}) 01011 10100 01011 10100 26
28
Co-clustering Co-clustering with Map-Reduce 2 -> 1,3 ( Key, value)=(2,{1,3}) 01011 10100 01011 10100 27
29
Co-clustering Co-clustering with Map-Reduce 28
30
Co-clustering Co-clustering with Map-Reduce p=1 key (intermediated) value (intermediated) 29
31
Co-clustering Co-clustering with Map-Reduce ( 1, {(2,4),(1,3)}) ( 2, {(4,0),(2,4)}) 01011 01011 10100 10100 30
32
Co-clustering 31
33
Co-clustering Performance tuning – Parameter has to do with thread pool sizes – Parameters are Number of map tasks 32
34
Outline Introduction Distributed Mining Process Co-clustering Experiments – Setup – Scalability and performance Related Work Conclusions Discussion 33
35
Experiments Setup 39 nodes in cluster Machines locates in 4 blade server Hadoop Distributed File System(HDFS) capacity: 2.4TB Sun JDK 1.6.0_03 Datasets: CPU2 * Intel Xeon 2.66GHz(two dual-core) Memory8GB OSLinux Red Hat Enterprise Linux 34
36
Experiments(cont’d) Scalability and performance Performance: the effect of parameters 1)maximum number of concurrent map tasks per node 2)number of reducer tasks 3)minimum input split size Scalability: wall-clock time vs. number of node 35
37
Experiments(cont’d) Preprocessing ISS Data Optimal values about Map-Reduce 6 concurrent map tasks / node 5 reduce tasks 256MB of input split size 6 5 256MB Figure 8 36
38
Experiments(cont’d) Co-clustering TREC Data when job size ↓ framework overheads ↑ Two observation 1)20±2 sec/iteration is better than a machine with 48GB RAM. 2)As the dataset size ↑, the implementation will achieve linear scaleup. 20±2 sec/iteration 37
39
Experiments(cont’d) Behavior of the co-clustering iteration no. of concurrent maps no. of reduce tasks input split size are almost identical with Figure 8 38
40
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work – Map-Reduce framework – Co-clustering Conclusions Discussion 39
41
Related Work Map-Reduce framework Simple but powerful Use distributed file system (GFS, HDFS…) Block-addressable storage & centralized metadata server 40
42
Related Work(cont’d) Co-clustering Cluster shapes Checkerboard partitions Properties of input data Optimization objective 41
43
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 42
44
Conclusions Designing a holistic approach to data mining Distributed infrastructure Map-Reduce Co-clustering Distributed Co-clustering framework Using relatively low-cost components Performance scales almost linearly with machine/disk ↑ Demonstrate result on real-world data sets 43
45
Outline Introduction Distributed Mining Process Co-clustering Experiments Related Work Conclusions Discussion 44
46
Discussion In distributed file system, how to deal with the situation if there are some tasks fail With the developing of hardware, will the performance increase linearly? Lack of experimental record. 45
47
Discussion 將 input split size 增加為 HDFS block 大小的數倍, 會導致更難以在 local 的 data copies 放置 map task , why? 46
48
Q & A 47
49
Thanks for your attention! 48
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.