A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris National Technical University of Athens Dept. of Electrical and Computer Engineering Computing Systems Laboratory

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Overview Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping scheme Vertical vs. hyperplane grouping Application on clusters of SMP nodes

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping TCP/IP over FastEthernet Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor sd writesend readreceive

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping CPU kernel mode buffer length TCP IP ETH Fast 2) CPU copies data from user to kernel space 3) CPU adds protocol headers 5) DMA copies data to NIC write(sd, buffer, length); Example: Send 1) system call (CPU) user 4) CPU programs DMA eng.

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping SCI What about Scalable Coherent Interface? Point-to-point, DSM approach

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping SCI DSM scheme exported memory segment imported memory segment SCI write 100 100 read 50

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping process VM area Physical Memory Contiguous data in process VM are not contiguous in Physical Memory SCI Zero Copy Scheme

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping process VM area Physical Memory is mapped to pinned down memory SCICreateSegment,SCIMapLocalSegment mapping between Virtual and contiguous Physical Memory SCI Zero Copy Scheme

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Data transfers Programmed I/O mode CPU handles data transferring “lost” CPU cycles DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping SCI SCI DMA approach No copying by CPU Data already contiguous in PM DMA engine copies data to network No packetization Done in hardware But, init only by kernel We need VIA

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Nested For-Loops for (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) for (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … … … … … for (i n =l n ; i n <=u n ; i n ++) { Loop Body }

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Dependence Vectors i2i2 i1i1 for (i 1 =0; i 1 <=7; i 1 ++) for (i 2 =0; i 2 <=7; i 2 ++) A[i,j]=A[i-1,j]+A[i,j-1]

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Tiling i2i2 i1i1

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Tiling i2i2 i1i1 Processor 0 Processor 1

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Non-Overlapping Scheme i2i2 i1i1 Processor 0 Processor 1 Processor 2

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Non-Overlapping vs. Overlapping Scheme P0 P1 P2 P3 P0 P1 P2 P3

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Overlapping Scheme i2i2 i1i1 Processor 0 Processor 1 Processor 2

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Generalization to SMPs P0 P1 P2 P3

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Generalization to SMPs SMP0 SMP1 SMP2 SMP3 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Generalization to SMPs CPU1 CPU0

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Generalization to SMPs SMP0 SMP1 SMP2 SMP3 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Vertical vs. Hyperplane grouping CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 SMP0 SMP1 SMP2 SMP3 SMP0 SMP1 SMP2 SMP3 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Example Tile Space Group Space SMP node0 SMP node1 Scheduling vector Π=(1,1)

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Non-overlapping vs. Overlapping scheme Almost half duration of execution steps Slightly more steps P0 P1 P2 P3 P0 P1 P2 P3 Non-overlapping scheme 9 computation +8 communication steps Overlapping scheme 12 steps

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Vertical vs. Hyperplane Grouping Slower pipeline filling Faster execution because of lack of intratile synchronization  preferable for Tile Spaces, where the mapping direction is comparatively large CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 SMP0 SMP1 SMP2 SMP3 SMP0 SMP1 SMP2 SMP3 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Experimental Platform Linux SMP (Symmetric Multi- Processors) Cluster 8 nodes 128MB RAM 2 Pentium III 800MHz SCI ring (SCI Dolphin’s PCI-SCI D330 cards)

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Initial Code for (i=1; i<=X; i++) for (j=1; j<=Y; j++) for (k=1; k<=Z; k++) { A[i][j][k] = func(A[i-1][j][k], A[i][j-1][k], A[i][j][k-1]) }

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Experimental results 3 3.5 4 4.5 5 5.5 6 6.5 7 0 5000 10000 15000 20000 25000 30000 35000 Time (sec) Tile Height 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0 5000 10000 15000 20000 25000 30000 35000 Time (sec) Tile Height Iteration Space 16x16x1024KIteration Space 48x48x512K Non-overlapping scheme – vertical grouping Overlapping scheme – vertical grouping Non-overlapping scheme – hyperplane grouping Overlapping scheme – hyperplane grouping

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Grouping matrix = number of CPUs within an SMP node

National Technical University of Athens Computing Systems Laboratory A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Example Tile Space Group Space SMP node0 SMP node1 Scheduling vector Π=(1,1)

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Similar presentations

Presentation on theme: "A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Similar presentations

Presentation on theme: "A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,"— Presentation transcript:

Similar presentations

About project

Feedback