CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Slides:



Advertisements
Similar presentations
1 Computational models of the physical world Cortical bone Trabecular bone.
Advertisements

Parallel computer architecture classification
CS 140: Models of parallel programming: Distributed memory and MPI.
Last Lecture The Future of Parallel Programming and Getting to Exascale 1.
BY MANISHA JOSHI.  Extremely fast data processing-oriented computers.  Speed is measured in “FLOPS”.  For highly calculation-intensive tasks.  For.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Chapter Chapter Goals Describe the layers of a computer system Describe the concept of abstraction and its relationship to computing Describe.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
CS 240A: Models of parallel programming: Distributed memory and MPI.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
Tools and Primitives for High Performance Graph Computation
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
Lecture 1: Introduction to High Performance Computing.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Chapter 01 Nell Dale & John Lewis.
1 High-Performance Graph Computation via Sparse Matrices John R. Gilbert University of California, Santa Barbara with Aydin Buluc, LBNL; Armando Fox, UCB;
1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Chapter 1 The Big Picture.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
CMSC104 Problem Solving and Computer Programming Spring 2011 Section 04 John Park.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Interactive Supercomputing Update IDC HPC User’s Forum, September 2008.
CMSC104 Problem Solving and Computer Programming Spring 2009 Sections 0201 & 0301 Ms. Dawn Block.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
CS240A: Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
NICS Update Bruce Loftis 16 December National Institute for Computational Sciences University of Tennessee and ORNL partnership  NICS is the 2.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
1 CMSC104 Problem Solving and Computer Programming Fall 2008 Section 0101 John Y. Park.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Measuring Performance II and Logic Design
Social Networks Some content from Ding-Zhu Du, Lada Adamic, and Eytan Adar.
Lecture 23: Structure of Networks
CMSC104 Problem Solving and Computer Programming Spring 2008
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CMSC104 Problem Solving and Computer Programming Fall 2010 Section 01
Super Computing By RIsaj t r S3 ece, roll 50.
Parallel Computers Today
Lecture 23: Structure of Networks
CMSC 104 Problem Solving and Computer Programming Fall 2010
Lecture 1: Parallel Architecture Intro
CMSC104 Problem Solving and Computer Programming Fall 2010
CMSC104 Problem Solving and Computer Programming Spring 2010
Lecture 23: Structure of Networks
CMSC104 Problem Solving and Computer Programming Fall 2009 Section 2
Problem Solving and Computer Programming
CMSC 611: Advanced Computer Architecture
Panel on Research Challenges in Big Data
CMSC104 Problem Solving and Computer Programming Spring 2010
Presentation transcript:

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for some of their slides.

Course bureacracy Read course home page on GauchoSpace Accounts on Triton, San Diego Supercomputing Center: Use “ssh –keygen –t rsa” and then your “id_rsa.pub” file to Stefan Boeriu, Triton logon demo & tool intro coming soon Watch (and participate in) the “Discussions, questions, and announcements” forum on the GauchoSpace page.

Homework 1 See GauchoSpace page for details. Find an application of parallel computing and build a web page describing it. Choose something from your research area. Or from the web or elsewhere. Create a web page describing the application. Describe the application and provide a reference (or link) Describe the platform where this application was run Find peak and LINPACK performance for the platform and its rank on the TOP500 list Find the performance of your selected application What ratio of sustained to peak performance is reported? Evaluate the project: How did the application scale, ie was speed roughly proportional to the number of processors? What were the major difficulties in obtaining good performance? What tools and algorithms were used? Send us (John and Varad) the link -- we will post them Due next Monday, April 9

Why are we here? Computational science The world’s largest computers have always been used for simulation and data analysis in science and engineering. Performance Getting the most computation for the least cost (in time, hardware, or energy) Architectures All big computers (and most little ones) are parallel Algorithms The building blocks of computation

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point ops/sec  PFLOPS = 1,000,000,000,000,000 / sec (10 15 )

Supercomputers 1976:Cray-1, 133 MFLOPS (10 6 ) Supercomputers 1976: Cray-1, 133 MFLOPS (10 6 )

Trends in processor clock speed

Trends in parallelism and data 16 X 500 million50 million Number of Facebook Users More cores and data  Need to extract algorithmic parallelism

Generic Parallel Machine Architecture Key architecture question: Where is the interconnect, and how fast? Key algorithm question: Where is the data? Proc Cache L2 Cache L3 Cache Memory Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects

AMD Opteron 12-core chip (e.g. LBL’s Cray XE6 “Hopper”)

4-core Intel Nehalem chip (2 per Triton node):

Triton memory hierarchy Node Memory Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache Chip Node

Triton overall architecture

One kind of big parallel application Example: Bone density modeling Physical simulation Lots of numerical computing Spatially local See Mark Adams’s slides…

“The unreasonable effectiveness of mathematics” As the “middleware” of scientific computing, linear algebra has supplied or enabled: Mathematical tools “Impedance match” to computer operations High-level primitives High-quality software libraries Ways to extract performance from computer architecture Interactive environments Computers Continuous physical modeling Linear algebra

16 Top 500 List (November 2011) = x P A L U Top500 Benchmark: Solve a large system of linear equations by Gaussian elimination

17 Large graphs are everywhere… WWW snapshot, courtesy Y. HyunYeast protein interaction network, courtesy H. Jeong Internet structure Social interactions Scientific datasets: biological, chemical, cosmological, ecological, …

Another kind of big parallel application Example: Vertex betweenness centrality Exploring an unstructured graph Lots of pointer-chasing Little numerical computing No spatial locality See Eric Robinson’s slides…

Social network analysis Betweenness Centrality (BC) C B (v): Among all the shortest paths, what fraction of them pass through the node of interest? Brandes’ algorithm A typical software stack for an application enabled with the Combinatorial BLAS

An analogy? Computers Continuous physical modeling Linear algebra Discrete structure analysis Graph theory Computers

Node-to-node searches in graphs … Who are my friends’ friends? How many hops from A to B? (six degrees of Kevin Bacon) What’s the shortest route to Las Vegas? Am I related to Abraham Lincoln? Who likes the same movies I do, and what other movies do they like?... See breadth-first search example slides

22 Graph 500 List (November 2011) Graph500 Benchmark: Breadth-first search in a large power-law graph

23 Floating-Point vs. Graphs, November 2011 = x P A L U Peta / 254 Giga is about 41,000! 10.5 Petaflops 254 Gigateps

An analogy? Well, we’re not there yet …. Discrete structure analysis Graph theory Computers  Mathematical tools ? “Impedance match” to computer operations ? High-level primitives ? High-quality software libs ? Ways to extract performance from computer architecture ? Interactive environments