Implementing Advanced Intelligent Memory

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.
Chapter Hardwired vs Microprogrammed Control Multithreading
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
Computer performance.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
ISCA2000 Norman Margolus MIT/BU SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas FlexRAM.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Thread-Level Speculation Karan Singh CS
Computer Organization and Design Computer Abstractions and Technology
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Generations of Computing. The Computer Era Begins: The First Generation  1950s: First Generation for hardware and software Vacuum tubes worked as memory.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
William Stallings Computer Organization and Architecture 6th Edition
Chapter 6 System Integration and Performance
These slides are based on the book:
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
William Stallings Computer Organization and Architecture 7th Edition
PROGRAMMABLE LOGIC CONTROLLERS SINGLE CHIP COMPUTER
Chapter 1: A Tour of Computer Systems
Reducing Hit Time Small and simple caches Way prediction Trace caches
Parallel Processing - introduction
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
INTRODUCTION TO MICROPROCESSORS
Architecture & Organization 1
Cache Memory Presentation I
INTRODUCTION TO MICROPROCESSORS
INTRODUCTION TO MICROPROCESSORS
CS775: Computer Architecture
Cache Memories September 30, 2008
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
BIC 10503: COMPUTER ARCHITECTURE
Microprocessor & Assembly Language
Hardware Multithreading
Multicore / Multiprocessor Architectures
Today’s agenda Hardware architecture and runtime system
Chapter 1 Introduction.
/ Computer Architecture and Design
Computer Evolution and Performance
Chapter 4 Multiprocessors
Mapping DSP algorithms to a general purpose out-of-order processor
Main Memory Background
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
A Level Computer Science Topic 5: Computer Architecture and Assembly
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
The University of Adelaide, School of Computer Science
CSE 502: Computer Architecture
Presentation transcript:

Implementing Advanced Intelligent Memory Josep Torrellas, U of Illinois & IBM Watson Ctr. David Padua and Dan Reed, U of Illinois torrella@watson.ibm.com, padua@cs.uiuc.edu, reed@cs.uiuc.edu September 1998

Technological Opportunity We can fabricate a large silicon area of Merged Logic and Dram (MLD) Question: How to exploit this capability best to advance computing?

Pieces of the Puzzle Today: 256 Mbit MLD process with 0.25um Includes logic running at 200 MHz E.g. 2 IBM PowerPC 603 with 8KB I+D caches take 10% of the chip Manufacturers: IBM Cmos-7LD technology available Fall 98 Japanese manufacturers (NEC,Fujitzu) are in the lead In a couple of years: 512 Mbit MLD process at 0.18um

Key Applications Clamor for HW Data Mining (decision trees and neural networks) Computational Biology (DNA sequence matching) Financial Modeling (stock options, derivatives) Molecular Dynamics (short-range forces) Plus the typical ones: MPEG, TPCD, speech recognition All are Data Intensive Applications

Our Solution: Principles 1. Extract high bandwidth from DRAM: > Many simple processing units 2. Run legacy codes w/ high performance: > Do not replace off-the-shelf uP in workstation > Take place of memory chip. Same interface as DRAM > Intelligent memory defaults to plain DRAM 3. Small increase in cost over DRAM: > Simple processing units, still dense 4. General purpose: > Do not hardwire any algorithm. No special purpose

Architecture Proposed P.Host L1,L2 Cache P.Mem Cache Plain DRAM P.Array DRAM FlexRAM Network

Proposed Work Design an architecture based on key IBM applications Fabricate chips using IBM Cmos 7LD technology Build a workstation w/ an intelligent memory system Build a language and compiler for the intelligent memory Demonstrate significant speedups on the applications

Example App: DNA Matching BLAST code from NIH web site sample DNA database of DNA chains Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chain

How the Algorithm Works 1. Pick 4 consecutive aminoacids from the sample bbcf 2. Generate 50+ most-likely mutations becf

Example App: DNA Matching 3. Compare them to every position in the database DNAs becf 4. If match is found: try to extend it sample DNA becf ? ? database of DNA chains becf

P.Arrays Total of 64 per chip (90 mm ) 2 SPMD engines, not SIMD. Cycling at 200 MHz 32-bit datapath, integer only, including MPY. 28 instruc. Organized as a ring, no need for a mesh Each P.Array 1 Mbyte of DRAM memory. Can also access the memory of N and S neighbors 2 1-Kbyte row buffers to capture data locality 8 Kbyte of SRAM I-memory shared by 4 P.Arrays

P.Array Design ALU Switches Input Reg. R.Reg. Sense AMP/Col. Dec Controller Port 0 Port 1 DRAM Block Port 2 Addr. Gen. Switches Instr. Mem ROW Decoder Broadcast Bus

P.Mem IBM 603 Power PC with 8 KB D + 8 KB I cache About 15 mm 200 MHz Also included: memory interface

DRAM Memory 512 Mbit (64 Mbyte) with 0.18um Organized as 64 banks of 1 MB each (one per P.Array) 2.2V operating voltage Internal memory bandwidth: 102 Gbytes/s at 200 MHz Memory access time at 200 MHz: 2 cycles for row buffer hit 4 cycles for miss

8kB Instruction Memory (4-port SRAM) Chip Architecture 8Mb Block PArray Memory Control Block 1MB Block 512 row x 4k columns 2Mb Block 256kB Block Mutiplier 8kB Instruction Memory (4-port SRAM) Basic Block Broadcasting (4 PArray,4MB DRAM, 8kB 4-Port SRAM, 1 Multiplier) Pmem

8kB Instruction Memory (4-port SRAM) Basic Block 1MB Block 512 row x 4k columns 2Mb Block 256kB Block 256kB Block Memory Control Block Memory Control Block PArray PArray PArray Mutiplier PArray Memory Control Block Memory Control Block 8kB Instruction Memory (4-port SRAM) 1MB Block 8Mb Block

Language & Compiler High-level C-like explicitly parallel language that exposes the architecture Compiler that automatically translates it into structured assembly Libraries of Intelligent Memory Operations (IMOs) written in assembly

Intelligent Memory Ops General-purpose operations such as: Arithmetic/logic/symbolic array operations Set operations. Iterators over elements of a set Regular/irregular structure search and update (CAM operations) Domain-specific operations: e.g. FFT

Performance Evaluation Hardware performance monitoring embedded in the chip Software tools to extract and interpret performance info

Relative Execution Time Preliminary Results 1 2 Uniprocessor 1 1 FlexRAM 8 Relative Execution Time 4 FlexRAM 6 4 2 MPEG2 Chroma/Keying

Current Status Identified and wrote all applications Designed architecture based on apps & IBM technology Conceived ideas behind language/compiler Need to do: chip layout and fabrication development of the compiler Funds needed for: processor core (P.Mem) chip fabrication hardware and software engineers

Conclusion We have a handle on: A promising technology (MLD) Key applications of industrial interest Real chance to transform the computing landscape

Josep Torrellas, U of Illinois & IBM Watson Ctr. Current Research Work Josep Torrellas, U of Illinois & IBM Watson Ctr. torrella@cs.uiuc.edu http://iacoma.cs.uiuc.edu September 1998

Current Research Projects 1. Illinois Aggressive COMA (I-ACOMA): Scalable NUMA and COMA architectures 2. FlexRAM: Avanced Intelligent Memory 3. Speculative Parallelization Hardware 4. Database Workload characterization: TPC-C, TPC-D, Data mining > All projects are in collaboration with IBM Watson > Project 4 is also in collaboration with Intel Oregon

Publications 1997 and 98 1.Architectural Advances in DSMs: A Possible Road Ahead by Josep Torrellas, Ninth SIAM Conference on Parallel Processing for Scientific Computing Spring 1999. 2.A Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors by Venkata Krishnan and Josep Torrellas, International Conference on Parallel Architectures and Compilation Techniques (PACT), October 1998. 3.Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor by Venkata Krishnan and Josep Torrellas, International Conference on Supercomputing (ICS), July 1998. 4.Comparing Data Forwarding and Prefetching for Communication-Induced Misses in Shared-Memory MPs by David Koufaty and Josep Torrellas, International Conference on Supercomputing (ICS), July 1998. 5.Cache-Only Memory Architectures by Fredrik Dahlgren and Josep Torrellas, IEEE Computer Magazine, to appear 1998. 6.Executing Sequential Binaries on a Multithreaded Architecture with Speculation Support by Venkata Krishnan and Josep Torrellas, Workshop on Multi-Threaded Execution, Architecture and Compilation (MTEAC'98), January 1998. 7.A Clustered Approach to Multithreaded Processors by Venkata Krishnan and Josep Torrellas, International Parallel Processing Symposium, March 1998. 8.Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors by Ye Zhang, Lawrence Rauchwerger, and Josep Torrellas, Fourth International Symposium on High-Performance Computer Architecture, February 1998. 9.Enhancing Memory Use in Simple Coma: Multiplexed Simple Coma by Sujoy Basu and Josep Torrellas, Fourth International Symposium on High-Performance Computer Architecture, February 1998. 10.How Processor-Memory Integration Affects the Design of DSMs by Liuxi Yang, Anthony-Trung Nguyen, and Josep Torrellas, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997. 11.Efficient Use of Processing Transistors for Larger On-Chip Storage: Multithreading by Venkata Krishnan and Josep Torrellas, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997. 12.The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors by Pedro Trancoso, Josep-L. Larriba-Pey, Zheng Zhang, and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997. 13.Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA by Zheng Zhang and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997. 14.Speeding up the Memory Hierarchy in Flat COMA Multiprocessors by Liuxi Yang and Josep Torrellas, Third International Symposium on High-Performance Computer Architecture, January 1997.