Presentation is loading. Please wait.

Presentation is loading. Please wait.

DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 0356169 吳宏君 0350741 陳威遠 0356042 洪浩哲.

Similar presentations


Presentation on theme: "DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 0356169 吳宏君 0350741 陳威遠 0356042 洪浩哲."— Presentation transcript:

1 DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 0356169 吳宏君 0350741 陳威遠 0356042 洪浩哲

2 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 1

3 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 2

4 Introduction Background Goal Map-Reduce 3

5 Background Huge datasets are becoming prevalent – Real-world application produce huge volumes of messy data (terabytes, or more) – pre-processing the raw data is important Map-reduce Tool – A simple but powerful execution engine – Unconcerned about data models and storage schemes 4

6 Goal Focus on co-clustering or bi-clustering of pairwise relationships from the raw data – Co-clustering searches for matrices of rows and columns that are inter-related Proposes a comprehensive Distributed Co- clustering (DisCo) solution from raw data to the end clusters. – Which involves data gathering, pre-processing, analysis, and presentation – Apply Map-Reduce(Hadoop) machine both as a programming model and implementation testbed. 5

7 Map-Reduce Distributed, scalable, fault-tolerant data storage, management and processing tools – Distributed execution engine for select-project via sequential scan, followed by hashed partitioning and sort-merge group-by. – Suited for data already stored on a distributed file system – Map-Reduce can transparently use any number of machines 6

8 Map-Reduce 7

9 8

10 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 9

11 Distributed Mining Process 10

12 Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose Extract SrcIP + DstIP and build adjacency matrix 11

13 Distributed Mining Process Data pre-processing – Building the graph from raw data – Pre-computing the transpose During co-clustering optimization, we need to iterate over both rows and columns. Pre-compute the adjacency lists for both the original graph as well as its transpose 12

14 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 13

15 Co-clustering Definitions and overview – Co-clustering allows simultaneous clustering of the rows and columns of a matrix – Input format: a matrix of m-rows and n-columns – co-clustering algorithm employs a checkerboard the original adjacency matrix → a grid of sub- matrices 14

16 Co-clustering 15

17 Co-clustering Goal – Find good group assignment vectors such that error function is minimized. 16

18 Co-clustering 17

19 Co-clustering 01011 10100 01011 10100 Co-clustering(A,k,l) k=2l=2 A= 18

20 Co-clustering 01011 10100 01011 10100 c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)=1 19

21 Co-clustering 01011 10100 01011 10100 c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 r(4)=2 r(3)=1 r(2)=1 r(1)=1 01011 01011 10100 10100 r(2)=2 20

22 Co-clustering c(1)=1c(2)=1c(3)=1c(4)=2c(5)=2 00111 00111 11000 11000 01011 01011 10100 10100 c(2)=2 21

23 Co-clustering Co-clustering with Map-Reduce One iteration over rows as a Map-Reduce job 22

24 Co-clustering Co-clustering with Map-Reduce 23

25 Co-clustering 01011 10100 01011 10100 Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce r, c, G random initialization based on parameter k, l 24

26 Co-clustering 01011 10100 01011 10100 Map 1 -> 2,4,5 2 -> 1,3 3 -> 2,4,5 4 -> 1,3 Co-clustering with Map-Reduce k=2, l=2 r = { 1,1,1,2} c = {1,1,1,2,2} 25

27 Co-clustering Co-clustering with Map-Reduce Fix column Row iteration 1 -> 2,4,5 ( Key, value)=(1,{2,4,5}) 01011 10100 01011 10100 26

28 Co-clustering Co-clustering with Map-Reduce 2 -> 1,3 ( Key, value)=(2,{1,3}) 01011 10100 01011 10100 27

29 Co-clustering Co-clustering with Map-Reduce 28

30 Co-clustering Co-clustering with Map-Reduce p=1 key (intermediated) value (intermediated) 29

31 Co-clustering Co-clustering with Map-Reduce ( 1, {(2,4),(1,3)}) ( 2, {(4,0),(2,4)}) 01011 01011 10100 10100 30

32 Co-clustering 31

33 Co-clustering Performance tuning – Parameter has to do with thread pool sizes – Parameters are Number of map tasks 32

34 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments – Setup – Scalability and performance  Related Work  Conclusions  Discussion 33

35 Experiments  Setup 39 nodes in cluster Machines locates in 4 blade server Hadoop Distributed File System(HDFS) capacity: 2.4TB Sun JDK 1.6.0_03 Datasets: CPU2 * Intel Xeon 2.66GHz(two dual-core) Memory8GB OSLinux Red Hat Enterprise Linux 34

36 Experiments(cont’d)  Scalability and performance Performance: the effect of parameters 1)maximum number of concurrent map tasks per node 2)number of reducer tasks 3)minimum input split size Scalability: wall-clock time vs. number of node 35

37 Experiments(cont’d)  Preprocessing ISS Data Optimal values about Map-Reduce 6 concurrent map tasks / node 5 reduce tasks 256MB of input split size 6 5 256MB Figure 8 36

38 Experiments(cont’d)  Co-clustering TREC Data when job size ↓ framework overheads ↑  Two observation 1)20±2 sec/iteration is better than a machine with 48GB RAM. 2)As the dataset size ↑, the implementation will achieve linear scaleup. 20±2 sec/iteration 37

39 Experiments(cont’d)  Behavior of the co-clustering iteration no. of concurrent maps no. of reduce tasks input split size are almost identical with Figure 8 38

40 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work – Map-Reduce framework – Co-clustering  Conclusions  Discussion 39

41 Related Work  Map-Reduce framework Simple but powerful Use distributed file system (GFS, HDFS…) Block-addressable storage & centralized metadata server 40

42 Related Work(cont’d)  Co-clustering Cluster shapes Checkerboard partitions Properties of input data Optimization objective 41

43 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 42

44 Conclusions  Designing a holistic approach to data mining Distributed infrastructure Map-Reduce Co-clustering  Distributed Co-clustering framework Using relatively low-cost components Performance scales almost linearly with machine/disk ↑ Demonstrate result on real-world data sets 43

45 Outline  Introduction  Distributed Mining Process  Co-clustering  Experiments  Related Work  Conclusions  Discussion 44

46 Discussion  In distributed file system, how to deal with the situation if there are some tasks fail  With the developing of hardware, will the performance increase linearly?  Lack of experimental record. 45

47 Discussion 將 input split size 增加為 HDFS block 大小的數倍, 會導致更難以在 local 的 data copies 放置 map task , why? 46

48 Q & A 47

49 Thanks for your attention! 48


Download ppt "DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 0356169 吳宏君 0350741 陳威遠 0356042 洪浩哲."

Similar presentations


Ads by Google