The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

SE-292 High Performance Computing
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Introductory Courses in High Performance Computing at Illinois David Padua.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Reference: Message Passing Fundamentals.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
1  1998 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Distributed Shared Memory Systems and Programming
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Example: Sorting on Distributed Computing Environment Apr 20,
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
DISTRIBUTED COMPUTING
Full and Para Virtualization
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Interconnection network network interface and a case study.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Parallel Computing Presented by Justin Reschke
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Cooperative Caching in Wireless P2P Networks: Design, Implementation And Evaluation.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Overview Parallel Processing Pipelining
Distributed Shared Memory
The University of Adelaide, School of Computer Science
Lecture 1 Runtime environments.
KERNEL ARCHITECTURE.
The University of Adelaide, School of Computer Science
STUDY AND IMPLEMENTATION
Multithreaded Programming
The University of Adelaide, School of Computer Science
Introduction, background, jargon
Lecture 1 Runtime environments.
Lecture 24: Multiprocessors
The University of Adelaide, School of Computer Science
Presentation transcript:

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03

Agenda Introduction Message Passing Process Oriented Concurrency Paradigm Hardware Description Software Considerations Measurements Future Work Summary

Introduction How do we get a whole bunch of processors to work together on the same problem in a scalable way? Test bed developed at Caltech for what they hoped to be a VLSI implementation Programmer controls data sharing, not cache coherency mechanisms Techniques for certain problems that give close to linear speed-up

Message Passing Communication and synchronization primitives seen by programmer –Barrier –Blocking sends and receives –Broadcasts and node to node message passing –Explicit sharing of data through sending messages –Programmer decides when updates are necessary Hardware structure is memory/processor node –Separate consideration for memory vs. inter-process communication –Optimize each –Memory is closer to where it will be used

Message Passing Hyper-cube communications –Scales well O(n lg n) cost O(lg n) worst case message delivery –Simple routing Discrete, 2-valued, n-tuple Process address gives routing instructions –Clustering Can use “spheres” of nodes for separate problems

Process Oriented Abstraction from direct hardware targeting Processes mapped to nodes –Multiple processes interleaved in single nodes –Unique addresses –Unique message channels –Programmer not concerned w/ actual number of nodes and node addresses Kernel required on each node –Provides routing services –Provides process management services –Requires processing time

Process Oriented Caltech disallows process node switching –Prevents effective run-time load balancing Programmer responsibility –Allows node ID to be included w/ process ID Can take advantage of hyper-cube routing simplifications Issue: Interleaving may be bad in certain cases –Context switch for message passing

Concurrency Paradigm Programmer must explicitly deal with concurrency Different from other approaches where compiler or hardware is expected to find parallelization Requires a restructuring of single processor ides –Bubble sort becomes a linear solution –A lot of solutions need to be redesigned altogether

Concurrency Paradigm Techniques –Exploit outer loop unrolling Sparse/Predictable messaging Good for science and engineering problems –Regular loops –Predictable flow –SIMD—Same thing on a whole lotta data

Hardware Description 64-node hyper-cube –5 ft., 700 watts, $80,000 –Linear projection –Simulation results led to hyper-cube choice –Allowed for slow network links compared to CPU Speed Node –8086 processor w/ 8087 coprocessor Needed good floating-point operations Slowed from 8 MHz to 5 MHz for 8087 –128K RAM—Spend money on other things –8K ROM for initialization and POSTs

Hardware Description Developed prototype as test bed and resource raiser for first prototype 2-cube Summer of 1983 to 6-cube First year: 560,000 node hours –2 hard errors –1 soft error/several days

Software Considerations Development and testing done on traditional machines Initialization had to deal with node checks in addition to RAM checks Extensions to C had to be developed to facilitate the machine’s use by other researchers Kernel must be developed –Deal with message passing constructs –Must manage requests from intermediate host (IH) –probe: Allows process access to message layer –spy: Allows IH to examine and modify kernel execution data

Measurements Speedup = T(1)/T(n) Efficiency = Speedup / N –1 is good –<= 1/N is bad Only really useful to measure scalability of an algorithm with problems requiring a lot more processes than nodes available

Measurements What affects efficiency? (Overhead) –Load balancing problems –Message start-up latency Big messages vs. small messages –Hop latency –Processor time used in message routing functions

Measurements Performance –Some apps achieved max of 3 MIPS in floating-point ops –Many other apps reached optimal speed-up compared to VAX11/780 with overheads of Low message frequency?

Future Work Move routing functions to network device Experiment with hybrid shared memory approach Allow for dynamic load-balancing Experiment with more programmer control of process to node assignments Try different problem areas to expand message protocol Make interface more programmer friendly

Summary New programming paradigm required Offers lots of advantages in the scientific and engineering problem set May be interesting to apply to other domains Achieved what appears to be excellent scalability Good success in limited domain

Questions? Comments? Snide Remarks?