Initial Kernel Timing Using a Simple PIM Performance Model Daniel S. Katz 1*, Gary L. Block 1, Jay B. Brockman 2, David Callahan 3, Paul L. Springer 1,

Slides:



Advertisements
Similar presentations
Eric Gallery Manuel Mendez David A. Turner Arturo I Concepcion.
Advertisements

Chapter 13 Control Structures. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display Control Structures Conditional.
XML: Managing Data Exchange Stylesheets. Lesson Contents CSS The basic XSL file XSL transforms Templates Sort Numbering Parameters and Variables Datatypes.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Seminar On “ OMNET++ Network Simulator” Presented By: Saurav K Bengani Guided By: Guided By: Dr. Andrew yang Dr. Andrew yang.
Flowcharts Amir Haider Lecturer NFC IEFR. Introduction The flowchart is a means of visually presenting the flow of data through an information processing.
ATACC 17August, 2003 ATACCC St. Petersburg, FL August 17, 2003 Richard M. Satava, MD FACS Professor of Surgery University of Washington School of Medicine.
Threads - Definition - Advantages using Threads - User and Kernel Threads - Multithreading Models - Java and Solaris Threads - Examples - Definition -
Project Characterization Real Time Image Processing Presented by: Baruch Koren Shahaf Fisher Technion – Israel Institute Of Technology Electrical Engineering.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.
CS Distributed Computing Systems Chin-Chih Chang, An Introduction to Threads.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Mstack on the Cray MTA-2. What is Mstack? (1) Written by Daniel M. Pressel of the US Army Research Lab Small, simple, lightweight, easily portable. Simulates.
PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.
XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
March 6, 2003Thomas Sterling - Caltech & JPL 1 Crystal Ball Panel Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.
Computer Architecture B-tree construction & Traversal YongSoo Bae Room 236, Engineering Building [Project 1]
A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
POLS 2300: Introduction to Library Research Timothy Bristow Research & Instruction Librarian, Scott Library.
Performance of mathematical software Agner Fog Technical University of Denmark
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
October 12, 2004Thomas Sterling - Caltech & JPL 1 Roadmap and Change How Much and How Fast Thomas Sterling California Institute of Technology and NASA.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.
Lesson 2-2 Example Use the Commutative and/or Associative Properties to find the sum mentally Step 1 Look for two numbers whose sum is.
Initial Kernel Timing Using a Simple PIM Performance Model Daniel S. Katz 1*, Gary L. Block 1, Jay B. Brockman 2, David Callahan 3, Paul L. Springer 1,
Block diagram reduction
CS533 Concepts of Operating Systems Jonathan Walpole.
SOS7 What will Cray do for Supercomputing in this Decade? Asaph Zemach Cray Inc.
Learners Support Publications Introduction to C++
HUMA 1970: Introduction to Library Research Timothy Bristow Research & Instruction Librarian, Scott Library.
Seminar On Rain Technology
National Aeronautics and Space Administration Jet Propulsion Laboratory, California Institute of Technology Bundle Protocol Status Scott Burleigh Jet Propulsion.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
SEMINAR TOPIC ON “RAIN TECHNOLOGY”
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Chapter 4 – Thread Concepts
Topics to be covered Instruction Execution Characteristics
Bare metal OS project CSSE 332 Operating Systems
ECE 486/586 Computer Architecture Introductions Instructor and You
Closing Remarks and Action Items
Chapter 4 – Thread Concepts
Concept Map: Clustering Visualizations of Categorical Domains
Compiler Back End Panel
HITS Hypertext Induced Topic Selection
© 2012 Elsevier, Inc. All rights reserved.
Compiler Back End Panel
© 2012 Elsevier, Inc. All rights reserved.
Parallel Computation Patterns (Scan)
Propaganda.
HITS Hypertext Induced Topic Selection
A Comparison-FREE SORTING ALGORITHM ON CPUs
EE 4xx: Computer Architecture and Performance Programming
Topic 1: Be able to combine functions and determine the resulting function. Topic 2: Be able to find the product of functions and determine the resulting.
Introduction to CUDA.
The Purpose of this Course
© 2012 Elsevier, Inc. All rights reserved.
© 2012 Elsevier, Inc. All rights reserved.
Operating System Overview
What is a Thread? A thread is similar to a process, but it typically consists of just the flow of control. Multiple threads use the address space of a.
Оюутны эрдэм шинжилгээний хурлын зорилго:
Test Prep/ Writing Lesson
Presentation transcript:

Initial Kernel Timing Using a Simple PIM Performance Model Daniel S. Katz 1*, Gary L. Block 1, Jay B. Brockman 2, David Callahan 3, Paul L. Springer 1, Thomas Sterling 1,4 1 Jet Propulsion Laboratory, California Institute of Technology, USA 2 University of Notre Dame, USA 3 Cray Inc., USA 4 California Institute of Technology, USA * Technical Group Supervisor Parallel Applications Technologies Group Initial Kernel Timing Using a Simple PIM Performance Model This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH

Parallel Applications Technologies Group - Purpose of this Poster Discuss initial results of paper-and-pencil studies of 4 application kernels applied to a processor-in-memory (PIM) system roughly similar to the Cascade Lightweight Processor (LWP) Application kernels: Linked list traversal Vector sum Bitonic sort Intent of work is to guide and validate work on Cascade in the areas of compilers, simulators, and languages

Parallel Applications Technologies Group - Poster Topics Generic PIM structure Concepts needed to program a parallel PIM system Locality Threads Parcels Simple PIM performance model For each kernel: Code(s) for a single PIM node Code(s) for multiple PIM nodes that move data to threads Code(s) for multiple PIM nodes that move threads to data Hand-drafted timing forecasts, based on the simple PIM performance model Lessons learned What programming styles seem to work best Looking at both expressiveness and performance more expressivecloser to h/w Assembly This Code C C++ Matlab