Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

TO COMPUTERS WITH BASIC CONCEPTS Lecturer: Mohamed-Nur Hussein Abdullahi Hame WEEK 1 M. Sc in CSE (Daffodil International University)
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
CSE115: Introduction to Computer Science I
Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
CSCE101 – Database Intro, CPU and Memory October 24, 2006.
GCSE Computing - The CPU
Chapter 6 Memory and Programmable Logic Devices
COM181 Computer Hardware Ian McCrumRoom 5B18,
Mark Hampton and Krste Asanović April 9, 2008 Compiling for Vector-Thread Architectures MIT Computer Science and Artificial Intelligence Laboratory University.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
ECEn 191 – New Student Seminar - Session 9: Microprocessors, Digital Design Microprocessors and Digital Design ECEn 191 New Student Seminar.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanović MIT Computer Science and Artificial Intelligence.
Presented By: Rodney Fluharty Dec. 07, Who is ARM? Advanced Risc Microprocessor is the industry's leading provider of 16/32-bit embedded RISC microprocessor.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
ECEn 191 – New Student Seminar - Session 6 Digital Logic Digital Logic ECEn 191 New Student Seminar.
CSE115: Introduction to Computer Science I Dr. Carl Alphonce 219 Bell Hall
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
JOP Java Optimized Processor DI Martin Schöberl. Content Targets Java Virtal Machine Three different architectures Datapath of JOP3 First results.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
EXPLORING THE TRADEOFFS BETWEEN PROGRAMMABILITY AND EFFICIENCY IN DATA-PARALLEL ACCELERATORS YUNSUP LEE, RIMAS AVIZIENIS, ALEX BISHARA, RICHARD XIA, DEREK.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Virtual Memory: Concepts Slides adapted from Bryant.
HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.
Understanding Parallel Computers Parallel Processing EE 613.
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
The Scale Vector-Thread Processor Ronny Krashinsky, Christopher Batten, Krste Asanović Vector-Thread ArchitectureScale Prototype.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Hardware Architecture
GCSE Computing - The CPU
Cache Memory and Performance
ECE354 Embedded Systems Introduction C Andras Moritz.
OCR GCSE Computer Science Teaching and Learning Resources
Section 9: Virtual Memory (VM)
The Vector-Thread Architecture
CSCE330 Computer Architecture
Architecture Background
FPGAs in AWS and First Use Cases, Kees Vissers
Basic Computer Organization
Computer Architecture 2
CS 105 “Tour of the Black Holes of Computing!”
Virtual Memory: Concepts /18-213/14-513/15-513: Introduction to Computer Systems 17th Lecture, October 23, 2018.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 41: Introduction to Reconfigurable Computing
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Figure 8.1 Architecture of a Simple Computer System.
PZ01C - Machine architecture
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Computer Evolution and Performance
The Vector-Thread Architecture
ECE 463/563 Fall `18 Memory Hierarchies, Cache Memories H&P: Appendix B and Chapter 2 Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture,
Computer Architecture
GCSE Computing - The CPU
Guest Lecturer: Justin Hsia
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
CSE 502: Computer Architecture
Research: Past, Present and Future
Instructor: Phil Gibbons
Presentation transcript:

Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication and synchronization models Inflexible Inefficient Expensive All-purpose programmable core Ronny Krashinsky Christopher Batten Krste Asanovic CPU Chip Set DSP1DSP2 FPGA ASIC DRAM SRAM DRAM Handles all information processing Unified software programming model Competitive in performance and energy Scale by tiling an efficient core Sensor Nets Servers Robots Laptops Embedded computing today… Routers Set-top Boxes Games TVs Smart phones

Control Processor VP0 Memory VP1VP2VP3VPN thread- fetch vector-fetch VT unifies the vector and multithreaded compute models A control processor interacts with a vector of virtual processors (VPs) Vector-fetch: control processor fetches instruction blocks for all VPs in parallel Thread-fetch: a VP fetches its own instruction blocks VT allows a seamless intermixing of vector and thread control vector-fetch vector-load vector-store vector-fetch vector-load vector-store vector-fetch vector-load Vector Execution Vector-Thread Architecture vector-store VP0 VP1 VP2 VP3 VPNControl Proc. Threaded Execution VP0 VP1 VP2 VP3 VPNControl Proc.

C0 C1 C2 C3 CMU SD Vector-Mem Unit 32 KB SRAM C0 C1 C2 C3 CMU SD C0 C1 C2 C3 CMU SD C0 C1 C2 C3 CMU SD Cache Tags CP Cache Control Control Processor (CP) – scalar RISC core Vector-Thread Unit – 4 lanes, 16 decoupled clusters, instruction fetch, load/store, and command management units, up to 128 VP threads Vector-Memory Unit – unit-stride, strided, and segment loads and stores, refill/access decoupling Cache – 4-port, non-blocking, 32-way set- associative, 32 KB Register File 32x32-bit Register File 32x32-bit Instr. cache 32x46-bit Instr. cache 32x46-bit Datapath 32-bit Datapath 32-bit Control Logic Automatic synthesis, place & route Preplaced standard cells, RAM blocks Aggressive clock-gating Iterative design flow Verification: formal equiv. check + sim. Vectorizable data processing applications, e.g a wireless transmitter: 9.7 ops per cycle Non-vectorizable encoder/decoder algorithms, e.g. ADPCM speech decompression: 6.5 ops per cycle Threaded IP routing table lookups: 6.1 ops per cycle 3mm Read/Write Crossbars Lane 0 Lane 1 Lane 2 Lane 3 TSMC 180 nm, 6 layers Al 7.1 M trans., 1.4 M gates, 397 K cells, 300 k RAM bits 16.6 mm 2 core area, 23.1 mm 2 chip area 260 MHz at 1.8 V, 600 mW typical 24 person-months design effort