Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
June 9, 2007 Animation of Important Concepts in Parallel Computer Architecture Gambhir, Gehringer & Solihin Animation of Important Concepts in Parallel.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
How Multi-threading can increase on-chip parallelism
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Memory management Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles.
Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.
0 HPEC 2010 Automated Software Cache Management.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Massively LDPC Decoding on Multicore Architectures Present by : fakewen.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Fall 2015, Aug 17 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Fall 2015 Introduction Vishwani D. Agrawal.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
1 Presenter: Min Yu,Lo 2015/10/9 Lauri Matilainen, Erno Salminen, Timo D. Hamalainen, and Marko Hannikainen International Conference on Embedded.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
WMPI 2006, Austin, Texas © 2006 John C. Koob An Empirical Evaluation of Semiconductor File Memory as a Disk Cache John C. Koob Duncan G. Elliott Bruce.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
Types of Operating Systems
What is cache memory?. Cache Cache is faster type of memory than is found in main memory. In other words, it takes less time to access something in cache.
Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Types of Operating Systems 1 Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
CPE432: Computer Design Course Introduction Dr. Gheith Abandah د. غيث علي عبندة.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
CSCI2510 Tutorial 5 Introduction to Cache Zong Wen
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
EECE571R -- Harnessing Massively Parallel Processors ece
Distributed Shared Memory
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Lecture 20: Scaling Inference with RDBMS
EE 4xx: Computer Architecture and Performance Programming
CS 286 Computer Organization and Architecture
Week1 software - Lecture outline & Assignments
Presentation transcript:

Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010

Outline Introduction - Hardware-managed caches - Software-managed caches, explicitly managed memory (EMM) - Cell Processor Programming models developed for the Cell Processor Performance comparison Conclusion

Hardware and Software managed caches Ref: Software-Managed Caches. Bruce Jacob, Electrical Engineering Department, University of Maryland

Managing the memory hierarchy in multicore processors Introduces trade-offs in terms of performance, code complexity and optimization effort. Based on hardware-managed caches - Single shared address space - Consistent view of shared memory Based on software-managed local memories - Disjoint address spaces - Need to keep them consistent

Architecture of Cell Processor Challenges: Local memory spaces Small local storage Data alignment

Application: Fixedgrid Fixedgrid is an atmospheric modeling application. It has two types of computational kernels 1. row discretization 2. column discretization

Programming Models (a) Cellgen Code example (b) Sequoia Code example

Performance SDK3 version does not require column data to be reordered on the PPE or SPE Unlike SDK3, the Cellgen and Sequoia versions do not support this kind of access, and instead the non-contiguous data is rearranged on the PPE

Conclusion The implicit management of parallelism and locality can produce code with performance comparable to that of code generated from explicit management of locality. Hand-tuning of data transfers is still necessary for optimization.

Reference "Programming Multiprocessors with Explicitly Managed Memory Hierarchies," S. Schneider, et al., Computer, vol. 42, no. 12, pp , Dec 2009.