Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.

Similar presentations


Presentation on theme: "A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005."— Presentation transcript:

1 A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

2 Overview 1. Background Background 2. Architecture Of C64 Crossbar Architecture Of C64 Crossbar 3. Performance Simulation Performance Simulation 4. Test Result Test Result 5. Performance Analysis Performance Analysis 6. Conclusion Conclusion 7. Future Work Future Work

3 Background 1. What is Cyclops64?  Cyclops64(C64), also called Blue Gene/C, is part of IBM Blue Gene project.  It is a cellular architecture-based supercomputer. Each chip consists of 75~80 custom designed 64-bit processors. Each processor will have two thread units, two integer units, and a floating point unit.  C64 is expected 1000 teraflops and will be one of the fastest supercomputers in the world.  The architecture was conceived by Cray award winner Monty Denneau, Verification testing and system software development is being done at our CAPSL group.Cray awardMonty Denneau 2. What is the project goal? Study of the architecture and performance of the C64 interconnection network, crossbar (part of Verification testing)

4 Host IF FIFO 64-bit x 64 Mickey tree Gbit ethernet Disk Mickey tree (DMA) Gbit ethernet (DMA) Mickey tree Gbit ethernet Disk Mickey tree (DMA) Gbit ethernet (DMA) C64 Processor TU FP ICache 5 Crossbar C64 Processor TU FP ICache 5 C64 Processor TU FP ICache 5 DDR2 SDRAM Controller 4 ASw (a part of 3D cube network) The other C64 chipsDDR2 SDRAM DIMMs FPGA Port 0-79 for C64 processors Port 80-83 for mpg ICache Port 84,85 for Host IF Port 86-89 for DRAM controller Port 90-95 for ASw Processor# 80 ICache# 16 mpg Configuration Pin * The configuration pins are Connected to all modules except DDR and Crossbar Cyclops64 CHIP

5 Architecture Of C64 Crossbar 1. On chip crossbar: Provide communication inside a single chip 2. 96-way crossbar: 96 input ports, 96 output ports. Each port can connect with any other port and itself. Any communication among processors, ICaches, SRAM, DRAM, and ASwitches has to go through the crossbar 3. Pipelined crossbar: 7 pipeline stages When full pipelined, each port flow out one packet each cycle Bandwidth of the crossbar = port number * length of the packet

6 SrcSplit TarCombine TUnitA 102+2 TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack 102 9210 9 FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 95 SrcSplit TarCombine 102+2 TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack 102 9210 9 FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 95 Port# 96 Crossbar Architecture SrcSplit TarCombine TUnitA 102+2 TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack 102 9210 9 FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 95 TUnitA TUnitB

7 Crossbar Architecture SrcSplit TarCombine 102+2 TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack 102 9210 9 FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 SrcSplit 102+2 TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack 102 9210 9 FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 SrcSplit 102+2 TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack 102 9210 9 FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 Port# 96 1 2 3 4 5 6 7 TUnitA 95 TUnitB TarCombine 95 TUnitB TarCombine 95 TUnitB

8 Performance Simulation 1. Performance Measurement Latency: The time required for a packet to traverse the network form source to destination Throughput: The rate at which packets are delivered by the network for a particular traffic pattern 2. Workloads Synthetic: Random Distributed vs Poisson Distributed Application Driven: Hello_World, Matrix_Cthread, Laplace_Cthread, Heat_Cthread, Cnet_get_nb, Cnet_put_nb, Dev_Align, Dev_Reset 3. Simulators Csim_crossbar LAST (Both designed by Fei Chen at CAPSL)

9 Parameters configuration PARAMETERS Workloads Arbitration Schemes SyntheticApplication Driven Benchmarks Temporal 1 Characteristics Spatial 2 Distributions Uniform Random Permutation (Neighbor & Tornado) Uniform Random Poisson Uniformly Random Matrix Circular Segmented Matrix Fixed Priority 1.Describe the generation probability of message over time 2.Determine the communication paths between the sources and destinations

10 Test Results: Latency - Synthetic Workloads Latency of Uniform Random Pattern goes infinite when injection rate > 0.6 Latency of Permutation Traffic is always 7 cycles without any change.

11 Test Results: Throughput - Synthetic Workloads (Cont) Uniform workload with permutation traffic pattern has linear throughput This network is a stable network

12 Test Results: Contention - Synthetic Workloads(Cont) Permutation Traffic has zero contention Uniform distribution has more contention than POISSON distribution

13 Performance Analysis One - Synthetic Workloads The least latency in the crossbar is 7 cycles. The crossbar is a stable network because its throughput does not degrade beyond the saturation point. Contention at the output causes the delay of transferring message, and permutation traffic has zero contention Uniformly random workload with permutation traffic has the best performance. When injection rate reaches 1.0, its throughput can achieve 1.

14 Test Results: Latency - Arbitration Schemes Fixed Priority Scheme is the worst case, its latency goes infinite at rate 0.5 Others have very similar latency behavior

15 Test Results: Throughput - Arbitration Schemes (Cont) Fixed Priority Scheme is the worst case, the network saturates at rate 0.5 Others have very similar throughput behavior

16 Performance Analysis Two - Arbitration Schemes  SLRU, PLRU, CIRC and RAND arbitration schemes show very similar performance behavior under uniformly random traffic pattern.  Fixed Priority arbitration scheme shows the worst performance behavior under the same situation.

17 Test Results – Application-Driven Benchmarks Application Number Of Packets Forword Latency (Avg) Reverse Latency (Avg) Forword Throughput (Avg) Reverse Throughput (Avg) Hello_World51107.3519.740.002 Heat_Cthread797586346.004034.000.0020.001 Matrix_Cthread11021821.59939.000.002 Cnet_get_nb101627.53853.5520.0010.002 Cnet_put_nb100527.61950.0270.0010.002 Dev_Align89247.28637.3810.002 Dev_Reset101487.61750.4130.0010.002 Average reverse latency increases very fast when packet number increased Forward and reverse traffics have different latency behavior

18 Performance Analysis -Application-Driven Benchmarks  C64 architecture classified traffic into: Class 0 (Forward traffic): messages send out from processor, like load request and stores from processors Class 1 (Reverse traffic): Messages send back to processors, like load return to processors  Reverse transfer delay is much bigger than forward transfer delay  Forward and reverse transfer have similar throughput

19 Conclusion For Synthetic Workloads V erified: C64 crossbar is a stable network The least latency of C64 crossbar is 7 cycles. Discovered: Traffic pattern, including temporal characteristics and spatial distribution, has sensitive affect on the crossbar performance behavior permutation spatial traffic has the best latency behavior. It keeps to have the least latency 7 cycles because it has zero contention. Uniform random distributed workload has better throughput behavior. Fixed priority arbitration scheme has worst performance behavior and others are very similar For Application-Driven Workload Discovered: Forward and reverse traffics have different latency behavior but similar throughput behavior Reverse traffic has worse latency behavior than forward

20 Future Work Synthetic Workloads  Investigate arbitration schemes under different traffic patterns Application-Driven Workloads  Investigate performance behavior of C64 Crossbar under different configuration constrains Number of used thread units Number of involved memory banks  Investigate performance behavior of C64 Crossbar under different arbitration schemes Summary of Performance Analyses Documentation

21 Acknowledge Fei Chen Yuhei Dimitri Joseph Ted Prof. Gao All people in CAPSL group

22 Question? Thanks!!!


Download ppt "A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005."

Similar presentations


Ads by Google