Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Similar presentations


Presentation on theme: "Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,"— Presentation transcript:

1 Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg, mgold@tx.technion.ac.il Mar 2004 (v1.2)

2 Parallel Programming on the SGI Origin2000 1)Parallelization Concepts 2)SGI Computer Design 3)Efficient Scalar Design 4)Parallel Programming -OpenMP 5)Parallel Programming- MPI

3 2) SGI Computer Design

4 Origin2000/3000 architecture features Important hardware and software components: * node board: processors + memory * node interconnect topology and configurations * scalability of the architecture * directory-based cache coherency * single system image components

5 Origin2000 node board

6 Origin node board HUB crossbar ASIC: - Single chip integrates all four functions: * processor interface: two rxK processors on the same bus * memory interface, integrating the memory controller and (direct) cache coherency * interface to CrayLink Interconnect to other nodes in the system * interface to I/O defices with XIO-to-PCI bridges - Memory access characteristics: * read bandwidth single processor 460 MB/s sustained * average access latency 315 ns to restart processor pipeline

7 Origin2000 node components

8 Origin router interconnect - Router chip has 6 CrayLink interfaces: 2 for connections to nodes (HUBs) and 4 for connections to other routers in the network * 4-dimensional interconnect - The interconnect topology is determined by the size of the computer (number of nodes): * direct (back-to-back) connection for 2 nodes (4 cpu) * strongly connected cube up to 32 cpu * hypercube for up to 64 cpu * hypercube of hypercubes for up to 256 cpu

9 Origin2000 – two nodes

10 Origin2000 module connections

11 Origin2000 interconnect

12 32 processors 64 processors

13 Origin2000 interconnect

14 Directory-based uniform cache Cache line use is recorded in directory, which resides in memory

15 Origin cache coherence - Memory page is divided in data blocks of 32 words or 128 bytes each (L2 cache line size) - Each data request transfers one data block (128 bytes) - Each data block has associated presence and state information...... presence state 64 bits 3 bits directory data block (cache line) 128 bytes (32 words) memory - If a node (HUB) requests a data block, the corresponding presence bit is set and the state of that cache line is recorded - HUB runs the cache coherence protocol, updating the state of the data block and notifying nodes for which the presence bit is set

16 Origin address space - Physically the memory is distributed and not contiguous - Node id is assigned at boot time - Logically memory is a shared single contiguous address space, the virtual address space is 44 bits (16 TB) - A program (compiler) uses the virtual address space - CPU translates from virtual to physical address space node id 8 bits Node offset 32 bits (4 GB) 39 32 31 0 k1n0k1n0 012n012n TLB Physical Virtual TLB – Translation Look-aside Buffer 0 1 2 3.. Node id Empty slot Memory present page

17 Summary: origin2000 properties - Single machine image * behaves like a large workstation * same compilers * time sharing * all SGI old code (binaries) will run * OS schedules the hardware resources on the machine - processor scalability 2-1024 cpu - I/O scalability - all memory and I/O devices are directly addressable * no limitations on the size of a single program, it can use all available memory * no limitations on the location of the data, all disks can be used in a single file system - 64 bit operating system and file system * HPC features: Checkpoint/restart, queueing system - machine stability

18 Origin2000/3000 architecture goal Hardware design – distributed memory But: to a programmer – It looks like shared memory

19 Example: Simple Memory Access

20

21

22

23 (1) NQS queues on parix (2) Interactive Maximum cputime = 15 minutes Parix run limits

24 Two ways to run a batch job (1) Parameters in command line (2) Parameters in script file

25 QSUB options

26 Output of command: “qstat –a”

27 Exercise 1 – login and submit a job

28


Download ppt "Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,"

Similar presentations


Ads by Google