Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich.

Slides:

Advertisements

Similar presentations

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Introductions to Parallel Programming Using OpenMP

Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Types of Parallel Computers

CSCI-455/522 Introduction to High Performance Computing Lecture 2.

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.

Chapter Hardwired vs Microprogrammed Control Multithreading

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Chapter 17 Parallel Processing.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Parallel Processing Architectures Laxmi Narayan Bhuyan

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Advanced Architectures

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.

Computer System Architectures Computer System Software

Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.

1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.

Synchronization and Communication in the T3E Multiprocessor.

Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.

1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Outline Why this subject? What is High Performance Computing?

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

WildFire: A Scalable Path for SMPs Erick Hagersten and Michael Koster Sun Microsystems Inc. Presented by Terry Arnold II.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

These slides are based on the book:

Clusters of Multiprocessor Systems

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Lecture 1: Parallel Architecture Intro

Symmetric Multiprocessing (SMP)

Introduction to Multiprocessors

Parallel Processing Architectures

Lecture 24: Memory, VM, Multiproc

High Performance Computing

Chapter 4 Multiprocessors

Presentation transcript:

Eidgenössische TechnischeHochschule Zürich Ecolepolytechniquefédérale de Zurich PolitecnicofederalediZurigo Swiss Federal Institute of Technology Zurich 25th Annual International Symposium on Computer Architecture 7th Workshop on Scalable Shared Memory Multiprocessor Memory System Performance of High End SMPs, PCs and Clusters of PCs Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Color Slides:

2 Memory Systems n Low End designs in PCs: u extremely low cost u standard I/O interface n High End designs in “Killer” Workstations: u well engineered memory systems u support for additional datastreams u better I/O busses n Are Low End SMPs the universal compute nodes for parallel and distributed systems?

3 Contribution n The answer is probably the memory system performance. n How significant are the differences in memory system performance? n Limitations of Low End memory systems u for local computation (e.g. in scientific applications) u for inter-node communication (e.g. in databases)

4 Extended Copy Transfer Characterization ECT is a method to characterize the performance of memory systems (ISCA95 and HPCA97): u Categories F Access pattern, stride (spatial locality) F Working set (temporal locality) u Value F Transfer bandwidth (large amount of data) u Same chart resulting from one microbenchmark F Local and Remote transfers F compute and communicate accesses

5 Measurement Problems Some parameter combinations are hard to measure, even with carefully tuned C code: u Reduced performance for large strides and small working-sets in L1 caches is a measurement artifact and not architecture related. u Compilers occasionally generate suboptimal instruction schedules for loads / stores.

6 Local Load Access: Pentium Pro PC Working set Access pattern (stride between 64bit words)

7 Local Load Access: SGI Origin Working set Access pattern (stride between 64bit words)

8 Local Load Access: DEC 8400 Working set Access pattern (stride between 64bit words)

9 Local Load Access: Sun Enterprise Working set Access pattern (stride between 64bit words) M 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K 8 K 4 K 2 K 1 K 0.5 K Load bandwidth (MBytes/sec) Load bandwidth (MByte/s) Sun Ultra Enterprise one Ultra SPARC II 248 MHz DRAM L1 L2

10 Local Load Access: SGI Cray T3E M 8 M 4 M 2 M 1 M 512 K 256 K 128 K 64 K 32 K 16 K 8 K 4 K 2 K 1 K 0.5 K Load bandwidth (MBytes/sec) Load bandwidth (MByte/s) Cray T3E one processor 300 MHz DRAM L1 L2 Working set Access pattern (stride between 64bit words)

11 Comparison - Local Access

12 Performance in an SMP setting n Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors n Topics of interest: u small working sets in caches: performance remains same u large working sets in memory: interesting differences u behavior for even/uneven strides n “Gather copy stream” (strided load / contiguous store)

13 Local Copy: Pentium Pro SMP

14 Local Copy: SGI Origin CC-NUMA

15 Local Copy: DEC 8400 SMP

16 Local Copy: Sun Enterprise SMP

17 Remote in Parallel Computers Parallel & Network Symmetric Computers Multiprocessors SGI Cray T3E, SGI Origin DEC 8400, Sun Enterprise, Clusters of PCs (CoPs) Pentium Pro SMPs Processor Caches Memory P C M P C M P C M Network P C P C P C MM Bus/Network PCM

18 Remote Transfers: CoPs Pentium Pro with SCI / Myrinet t t t

19 Remote Transfers: SGI Origin

20 Remote Transfers: DEC 8400

21 Remote Transfers: SGI Cray T3E

22 Comparison - Remote Transfers

23 Improvement of PC Chipsets n Intel 440 BX AGP Chip Set 400 MHz / 100 MHz n Intel 440 LX AGP Chip Set 233 MHz / 66 MHz n Intel 440 FX Natoma Chip Set 200 MHz / 66 MHz

24 Conclusion n ECT-Characterizations for different memory systems: u T3E (MMP-Node), Origin (NUMA), DEC8400 (SMP) u CoPs Intel P6 SMPs and Clusters n High End SMP vs. Low End SMP: u Less than half performance on two processor PCs. n Fast communication puts high demands on the memory system: u Unlike in traditional SMPs and CC-NUMAs fine grained remote access do not perform at all in PC-SMPs and CoPs n Adding more commodity microprocessors processors without reinforcing the memory system is therefore questionable.