CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Slides:



Advertisements
Similar presentations
Michael Brost. History Theory Examples Future Man communicating with Man Man communicating with Machine
Advertisements

1 Computational models of the physical world Cortical bone Trabecular bone.
The Central Processing Unit: What Goes on Inside the Computer.
Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.
CS 140: Models of parallel programming: Distributed memory and MPI.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
Lecture 1: Introduction to High Performance Computing.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CSE 260 Parallel Computation Allan Snavely, Henri Casanova
1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 CHAPTER 2 COMPUTER HARDWARE. 2 The Significance of Hardware  Pace of hardware development is extremely fast. Keeping up requires a basic understanding.
Lecture 2 : Introduction to Multicore Computing
Introduction CSE 410, Spring 2008 Computer Systems
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Lecture 1: What is a Computer? Lecture for CPSC 2105 Computer Organization by Edward Bosworth, Ph.D.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
CS 240A Models of parallel programming: Distributed memory and MPI.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
CMSC104 Problem Solving and Computer Programming Spring 2011 Section 04 John Park.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Computer Science Theory & Introduction Week 1 Lecture Material – F'13 Revision Doug Hogan Penn State University CMPSC 201 – C++ Programming for Engineers.
Computer Organization & Assembly Language © by DR. M. Amer.
Computer Architecture CPSC 350
CMSC104 Problem Solving and Computer Programming Spring 2009 Sections 0201 & 0301 Ms. Dawn Block.
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
1 COMS 361 Computer Organization Title: Performance Date: 10/02/2004 Lecture Number: 3.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
ECEN2102 Digital Logic Design Lecture 0 Course Overview Abdullah Said Alkalbani University of Buraimi.
History of Computers and Performance David Monismith Jan. 14, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
William Stallings Computer Organization and Architecture 6th Edition
CMSC104 Problem Solving and Computer Programming Spring 2008
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction Super-computing Tuesday
Constructing a system with multiple computers or processors
Architecture & Organization 1
CS775: Computer Architecture
Parallel Computers Today
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
CS/EE 6810: Computer Architecture
Course Description: Parallel Computer Architecture
Constructing a system with multiple computers or processors
CMSC104 Problem Solving and Computer Programming Fall 2010
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Course Code 114 Introduction to Computer Science
Presentation transcript:

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for some of their slides.

Why are we here? Computational science The world’s largest computers have always been used for simulation and data analysis in science and engineering. Performance Getting the most computation for the least cost (in time, hardware, or energy) Architectures All big computers (and most little ones) are parallel Algorithms The building blocks of computation

Course bureacracy Read course home page on GauchoSpace Accounts on Triton/TSCC, San Diego Supercomputing Center: Use “ssh –keygen –t rsa” and then your PUBLIC key file “id_rsa.pub” to Kadir Diri, Triton logon demo & tool intro coming soon Watch (and participate in) the “Discussions, questions, and announcements” forum on the GauchoSpace page.

Homework 1: Two parts Part A: Find an application of parallel computing and build a web page describing it. Choose something from your research area, or from the web. Describe the application and provide a reference. Describe the platform where this application was run. Evaluate the project. Send us (John and Veronika) the link -- we will post them. Part B: Performance tuning exercise. Make my matrix multiplication code run faster on 1 processor! See GauchoSpace page for details. Both due next Tuesday, January 14.

Trends in parallelism and data 16 X 500 million50 million Number of Facebook Users More cores and data  Need to extract algorithmic parallelism

Parallel Computers Today Intel 61-core Phi chip 1.2 TFLOPS Oak Ridge / Cray Titan 17 PFLOPS Nvidia GTX GPU 1.5 TFLOPS  TFLOPS = floating point ops/sec  PFLOPS = 1,000,000,000,000,000 / sec (10 15 )

Supercomputers 1976:Cray-1, 133 MFLOPS (10 6 ) Supercomputers 1976: Cray-1, 133 MFLOPS (10 6 )

Technology Trends: Microprocessor Capacity Moore’s Law: #transistors/chip doubles every 1.5 years Moore’s Law Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

“Automatic” Parallelism in Modern Machines Bit level parallelism within floating point operations, etc. Instruction level parallelism multiple instructions execute per clock cycle Memory system parallelism overlap of memory operations with computation OS parallelism multiple jobs run in parallel on commodity SMPs There are limits to all of these -- for very high performance, user must identify, schedule and coordinate parallel tasks

Number of transistors per processor chip

Bit-Level Parallelism Instruction-Level Parallelism Thread-Level Parallelism?

Trends in processor clock speed

Generic Parallel Machine Architecture Key architecture question: Where is the interconnect, and how fast? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

AMD Opteron 12-core chip (e.g. LBL’s Cray XE6 “Hopper”)

Triton memory hierarchy: I (Chip level) Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache (8MB) Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Chip (AMD Opteron 8-core Magny-Cours) Chip sits in socket, connected to the rest of the node...

Triton memory hierarchy II (Node level) Shared Node Memory (64GB) Node L3 Cache (8 MB) P L1/L2 L3 Cache (8 MB) P L1/L2 L3 Cache (8 MB) P L1/L2 L3 Cache (8 MB) P L1/L2 Chip

Triton memory hierarchy III (System level) 64GB Node 324 nodes, message-passing communication, no shared memory

One kind of big parallel application Example: Bone density modeling Physical simulation Lots of numerical computing Spatially local See Mark Adams’s slides…

“The unreasonable effectiveness of mathematics” As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

20 Top 500 List (November 2013) = x P A L U Top500 Benchmark: Solve a large system of linear equations by Gaussian elimination

21 Large graphs are everywhere… WWW snapshot, courtesy Y. HyunYeast protein interaction network, courtesy H. Jeong Internet structure Social interactions Scientific datasets: biological, chemical, cosmological, ecological, …

Another kind of big parallel application Example: Vertex betweenness centrality Exploring an unstructured graph Lots of pointer-chasing Little numerical computing No spatial locality See Eric Robinson’s slides…

Social network analysis Betweenness Centrality (BC) C B (v): Among all the shortest paths, what fraction of them pass through the node of interest? Brandes’ algorithm A typical software stack for an application enabled with the Combinatorial BLAS

An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

Node-to-node searches in graphs … Who are my friends’ friends? How many hops from A to B? (six degrees of Kevin Bacon) What’s the shortest route to Las Vegas? Am I related to Abraham Lincoln? Who likes the same movies I do, and what other movies do they like?... See breadth-first search example slides

26 Graph 500 List (November 2013) Graph500 Benchmark: Breadth-first search in a large power-law graph

27 Floating-Point vs. Graphs, November 2013 = x P A L U Peta / 15.3 Tera is about Petaflops 15.3 Terateps

28 Floating-Point vs. Graphs, November 2013 = x P A L U Nov 2013: 33.8 Peta / 15.3 Tera ~ 2,200 Nov 2010: 2.5 Peta / 6.6 Giga ~ 380, Petaflops 15.3 Terateps

Course bureacracy Read course home page on GauchoSpace Accounts on Triton/TSCC, San Diego Supercomputing Center: Use “ssh –keygen –t rsa” and then your PUBLIC key file “id_rsa.pub” to Kadir Diri, Triton logon demo & tool intro coming soon Watch (and participate in) the “Discussions, questions, and announcements” forum on the GauchoSpace page.

Homework 1: Two parts Part A: Find an application of parallel computing and build a web page describing it. Choose something from your research area, or from the web. Describe the application and provide a reference. Describe the platform where this application was run. Evaluate the project. Send us (John and Veronika) the link -- we will post them. Part B: Performance tuning exercise. Make my matrix multiplication code run faster on 1 processor! See GauchoSpace page for details. Both due next Tuesday, January 14.