Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Today’s topics Single processors and the Memory Hierarchy
Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks ______________________________ John Kim, William J. Dally &Dennis Abts Presented.
1 Interconnection Networks Direct Indirect Shared Memory Distributed Memory (Message passing)
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
ECE669 L12: Interconnection Network Performance March 9, 2004 ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.
Interconnection Network Topology Design Trade-offs
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
ECE669 L16: Interconnection Topology March 30, 2004 ECE 669 Parallel Computer Architecture Lecture 16 Interconnection Topology.
Storage area network and System area network (SAN)
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Interconnect Network Topologies
Gigabit Routing on a Software-exposed Tiled-Microprocessor
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
Interconnect Networks
On-Chip Networks and Testing
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.
Elastic-Buffer Flow-Control for On-Chip Networks
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Networks-on-Chips (NoCs) Basics
Network Aware Resource Allocation in Distributed Clouds.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal.
PPC Spring Interconnection Networks1 CSCI-4320/6360: Parallel Programming & Computing (PPC) Interconnection Networks Prof. Chris Carothers Computer.
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
1 Dynamic Interconnection Networks Miodrag Bolic.
1 Message passing architectures and routing CEG 4131 Computer Architecture III Miodrag Bolic Material for these slides is taken from the book: W. Dally,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken.
Network-on-Chip Introduction Axel Jantsch / Ingo Sander
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Interconnection network network interface and a case study.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.
Super computers Parallel Processing
Topology How the components are connected. Properties Diameter Nodal degree Bisection bandwidth A good topology: small diameter, small nodal degree, large.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Interconnection Networks Communications Among Processors.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Network Connected Multiprocessors
Parallel Architecture
Lecture 23: Interconnection Networks
Exploring Concentration and Channel Slicing in On-chip Network Router
Interconnection Network Routing, Topology Design Trade-offs
Interconnection Network Design Lecture 14
Advanced Computer and Parallel Processing
Advanced Computer and Parallel Processing
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

The C64 system is a petaflop supercomputer built on multi-core system-on-a-chip (SoC) technology, based on a cellular architecture and expected to achieve over one petaflop peak performance. A maximum configuration of a C64 system consists of 13,824 C64 processing nodes (1million processors) connected by a 3D-mesh network. Each node is composed of a C64 chip, external DRAMs and a small number of external modules. A C64 chip consists of up to 80 custom- designed 64-bit processors (each consists of two thread processing cores), 16 shared instruction caches (I-caches), 160 on-chip embedded SRAM memory banks and 80 floating point units (FP). It is interesting to note that there is no data cache on the chip. Instead, each SRAM bank on the chip can be configured into two levels: global interleaved memory banks (GM) which are uniformly addressable, and scratch pad memories (SP) that are local to individual processors. The C64 chip configuration used in this study integrates 75 processors on a single chip. Each processor contains two thread units, one floating point unit and two 32KB SRAM memory banks. Groups of five processors share one I-Cache.

IBM Cyclops Project

Interconnection Network System = Processor Tiles + Channels + Routers

Router Architecture Input-queued Virtual Channel Speculative Pipeline

Switches Low-swing bit lines Operate at channel rate Reduces area and hence power Equalized drive Buffered crosspoints Integral allocation

Torus

Concentrated Mesh Source: Balfour and Dally, ICS 06

Express Links Source: Balfour and Dally, ICS 06

The most important quality measures of an interconnection network are its : 1. Degree - the maximum degree of all PUs; 2.Diameter - the maximum distance between any pair of PUs in the network. 3.Bisection width, the minimum number of connections that must be removed in order to decompose a processor network with n PUs into two networks with at most round_up(n / 2) PUs.

Comparison of the diameter (D) and average diameter (Dm)of toruses, fat-trees and circulant graphs (Project SWISS) 1.Toruses always have the worst diameter 2.Fat-trees appear to have the best diameter but the difference with circulant graphs is decreasing with increasing degree; 3.The average diameter of fat-trees is very close to its diameter, as a consequence, for degrees greater than 4 and a size smaller than 1000, the average diameter of circulant graphs is smaller than the one of fat-trees; 4.For a number of PUs up to 1000 the diameter of circulant graphs is smaller or equivalent to the one of fat-tree as soon as the degree is greater than 6; 5.Fat-trees always have the best bisectional width, toruses the worst ones, and the bisectional width of circulant graphs is very erratic.

Comparison of the bisectional width of toruses, fat- trees and circulant graphs Based on these results we can discard the toruses that always have the worst diameter and bisectional width. Small degree fat-trees seem to be the best choice even if the difference with circulant graphs is not spectacular. Nevertheless, the drawback of fat-tree is that they are extremely rigid. We have the following properties The number of fat-trees of a given degree d and of size N is equal to <N. For d=8 and N=1000 this number is equal to 3; Performant circular graphs can be found for any number of PUs.

Comparison of the bisectional width of toruses, fat-trees and circulant graphs

Building up systems with several hundred blocks requires building a matrix of high-speed, high-fanout fat-tree switches to interconnect the processors. Courtesy Compaq Computer Corporation, Manchester, U.K.

To understand how technology changes affect the optimal network radix, consider the latency (T ) of a packet traveling through a network. The header latency (Th) is the time for the beginning of a packet to traverse the network and is equal to the number of hops a packet takes times a per hop router delay (t r ). Since packets are generally wider than the network channels, the body of the packet must be squeezed across the channel, incurring an additional serialization delay (Ts). Thus, total delay can be written as T = Th + Ts = Ht r + L/b (1) where H is the number of hops a packet travels, L is the length of a packet, and b is the bandwidth of the channels. For an N node network with radix k routers (k input channels and k output channels per router), the number of hops must be at least 2log k N. Also, if the total bandwidth of a router is B, that bandwidth is divided among the 2k input and output channels and b = B/2k.

Substituting this into the expression for latency from equation (1) T = 2t r log k N + 2kL/B (2) Then, setting dT/dk equal to zero and isolating k gives the optimal radix in terms of the network parameters, k log 2 k =Bt r logN/ L (3) Router delay t r can be expressed as the number of pipeline stages (P) and times the cycle time (t cy ). As radix increases, t cy remains constant and P increases logarithmically. The number of pipeline stages P can be further broken down into a component that is independent of the radix (X)and a componentwhich is dependent on the radix (Y log 2 k). Thus router delay (t r ) can be rewritten as t r = t cy P = t cy (X + Y log 2 k)` (4)

Radix Clos Rank 2 Network Latency Latency = H t r + L / b = 2t r log k N + 2kL / B where k = radix B = total router Bandwidth N = number of nodes L = message size

Chip radix switch latency

Radix Clos Rank 2 Network