CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Slides:



Advertisements
Similar presentations
1 Computational models of the physical world Cortical bone Trabecular bone.
Advertisements

CS 140: Models of parallel programming: Distributed memory and MPI.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
Lecture 1: Introduction to High Performance Computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CSE 260 Parallel Computation Allan Snavely, Henri Casanova
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Computer System Architectures Computer System Software
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Lecture 2 : Introduction to Multicore Computing
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Introduction CSE 410, Spring 2008 Computer Systems
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
1 Dr. Michael D. Featherstone Introduction to e-Commerce Laws of the Web.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Computer Organization & Assembly Language © by DR. M. Amer.
Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Morgan Kaufmann Publishers
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Software System Performance CS 560. Performance of computer systems In most computer systems:  The cost of people (development) is much greater than.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
William Stallings Computer Organization and Architecture 6th Edition
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction Super-computing Tuesday
CSE 410, Spring 2006 Computer Systems
Morgan Kaufmann Publishers
COSC 3406: Computer Organization
Architecture & Organization 1
Multi-Processing in High Performance Computer Architecture:
Parallel Computers Today
Multi-Processing in High Performance Computer Architecture:
Architecture & Organization 1
Course Description: Parallel Computer Architecture
Chapter 1 Introduction.
Computer Evolution and Performance
COMS 361 Computer Organization
Presentation transcript:

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for some of their slides.

Course bureacracy Read course home page Join Google discussion group (see course home page) Accounts on Triton, San Diego Supercomputing Center: Use “ssh –keygen –t rsa” and then your “id_rsa.pub” file to Stefan Boeriu, If you weren’t signed up for the course as of last week, me your registration info right away Triton logon demo & tool intro coming soon– watch Google group for details

Homework 1 See course home page for details. Find an application of parallel computing and build a web page describing it. Choose something from your research area. Or from the web or elsewhere. Create a web page describing the application. Describe the application and provide a reference (or link) Describe the platform where this application was run Find peak and LINPACK performance for the platform and its rank on the TOP500 list Find the performance of your selected application What ratio of sustained to peak performance is reported? Evaluate the project: How did the application scale, ie was speed roughly proportional to the number of processors? What were the major difficulties in obtaining good performance? What tools and algorithms were used? Send us (John and Hans) the link (we will post them) Due next Monday, April 5

Two examples of big parallel problems Bone density modeling: Physical simulation Lots of numerical computing Spatially local Vertex betweenness centrality: Exploring an unstructured graph Lots of pointer-chasing Little numerical computing No spatial locality

Social newtork analysis Betweenness Centrality (BC) C B (v): Among all the shortest paths, what fraction of them pass through the node of interest? Brandes’ algorithm A typical software stack for an application enabled with the Combinatorial BLAS

6 X ATAT (A T X). *¬X  Betweenness Centrality using Sparse GEMM Parallel breadth-first search is implemented with sparse matrix-matrix multiplication Work efficient algorithm for BC

BC performance in distributed memory TEPS: Traversed Edges Per Second Batch of 512 vertices at each iteration Code only a few lines longer than Matlab version Input: RMAT scale N 2 N vertices Average degree 8 Pure MPI-1 version. No reliance on any particular hardware

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point ops/sec  PFLOPS = 1,000,000,000,000,000 / sec (10 15 )

AMD Opteron quad-core die

Cray XMT (highly multithreaded shared memory)

Top 500 List

The Computer Architecture Challenge  Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.  Originally, because linear algebra is the middleware of scientific computing.  Nowadays, mostly for bragging rights. = x P A L U

Why are powerful computers parallel?

Technology Trends: Microprocessor Capacity Moore’s Law: #transistors/chip doubles every 1.5 years Moore’s Law Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

Scaling microprocessors What happens when feature size shrinks by a factor of x? Clock rate used to go up by x, but no longer Clock rates are topping out due to power (heat) limits Transistors per unit area goes up by x 2 Die size also tends to increase Typically another factor of ~x Raw computing capability of the chip goes up by ~ x 3 ! But it’s all for parallelism, not speed

How fast can a serial computer be? Consider the 1 Tflop sequential machine data must travel some distance, r, to get from memory to CPU to get 1 data element per cycle, this means 10^12 times per second at the speed of light, c = 3e8 m/s so r < c/10^12 =.3 mm Now put 1 TB of storage in a.3 mm^2 area each word occupies ~ 3 Angstroms^2, the size of a small atom r =.3 mm 1 Tflop 1 TB sequential machine

“Automatic” Parallelism in Modern Machines Bit level parallelism within floating point operations, etc. Instruction level parallelism multiple instructions execute per clock cycle Memory system parallelism overlap of memory operations with computation OS parallelism multiple jobs run in parallel on commodity SMPs There are limits to all of these -- for very high performance, user must identify, schedule and coordinate parallel tasks

Number of transistors per processor chip

Bit-Level Parallelism Instruction-Level Parallelism Thread-Level Parallelism?

Generic Parallel Machine Architecture Key architecture question: Where is the interconnect, and how fast? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects