1 Digital Space Anant Agarwal MIT and Tilera Corporation.

Slides:

Advertisements

Similar presentations

The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy Dongyuan Zhan CS252 S05.

Computer Abstractions and Technology

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.

CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?

The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.

Performance Characterization of the Tile Architecture Précis Presentation Dr. Matthew Clark, Dr. Eric Grobelny, Andrew White Honeywell Defense & Space,

A tiled processor architecture prototype: the Raw microprocessor October 02.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Memory Management 2010.

Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.

Gigabit Routing on a Software-exposed Tiled-Microprocessor

Computer System Architectures Computer System Software

Computer Architecture Challenges Shriniwas Gadage.

The Tile Processor: A 64-Core Multicore for Embedded Processing Anant Agarwal Tilera Corporation HPEC 2007.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.

CSE 661 PAPER PRESENTATION

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology.

Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.

Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Lynn Choi School of Electrical Engineering

ECE354 Embedded Systems Introduction C Andras Moritz.

Microarchitecture.

Packet Switching on Raw

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

Lecture 41: Introduction to Reconfigurable Computing

Software Defined Networking (SDN)

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Chapter 1 Introduction.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

RAW Scott J Weber Diagrams from and summary of:

A Level Computer Science Topic 5: Computer Architecture and Assembly

Presentation transcript:

1 Digital Space Anant Agarwal MIT and Tilera Corporation

2 Arecibo

3 Stages of Reality mem 1996 CPU Mem p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m s p m s p m s p m s p m s p m s p m s p m s p m p m s p m p m p m p m p m p m p m p m p m p m B transistors B transistors 2007

4 Virtual reality Simulator reality Prototype reality Product reality Virtual reality Simulator reality Prototype reality Simulator reality Product reality Prototype reality Simulator reality Virtual Reality

5 The Opportunity 20MIPS cpu in … Few thousand gates

6 The Opportunity The billion transistor chip of 2007

7 How to Fritter Away Opportunity the x1786? does not scale 100 ported RegFil and RR Caches Control More resolution buffers, control “1/10 ns”

8 Lots of ALUs, lots of registers, lots of local memories – huge on-chip parallelism – but with a slower clock Custom-routed, short wires optimized for specific applications Fast, low power, area efficient But not programmable mem Take Inspiration from ASICs

9 Our Early Raw Proposal CPU Mem E.g., 100-way unrolled loop, running on 100 ALUs, 1000 regs, 100 memory banks But how to build programmable, yet custom, wires? Got parallelism?

10 A digital wire Ctrl Software orchestrate it! Customize to application and maximize utilization Multiplex it! Improve utilization Pipeline it! Fast clock (10GHz in 2010) Uh! What were we smoking! A dynamic router! Replace custom wires with routed on-chip networks

11 Static Router Application Compiler Switch Code Switch Code Switch Code Switch Code Switch Code A static router!

12 Replace Wires with Routed Networks Ctrl

13 50-Ported Register File  Distributed Registers Gigantic 50 ported register file

14 Gigantic 50 ported register file 50-Ported Register File  Distributed Registers

15 Distributed Registers + Routed Network Distributed register file R Called NURA [ASPLOS 1998]

16 16-Way ALU Clump  Distributed ALUs ALU Bypass Net RF

17 Distributed ALUs, Routed Bypass Network ALU R Scalar Operand Network (SON) [TPDS 2005]

18 Mongo Cache  Distributed Cache Gigantic 10 ported cache

19 Distributing the Cache

20 Distributed Shared Cache ALU Like DSM (distributed shared memory), cache is distributed; But, unlike NUCA, caches are local to processors, not far away R $ [ISCA 1999]

21 Tiled Multicore Architecture ALU R $

22 E.g., Operand Routing in 16-way Superscalar ALU Bypass Net RF >> + Source: [Taylor ISCA 2004]

23 Operand Routing in a Tiled Architecture >> ALU >> + R $

24 Tiled Multicore Scales to large numbers of cores Modular – design, layout and verify 1 tile Power efficient [ MIT-CSAIL-TR ] – Short wires C V 2 f – Chandrakasan effect C V 2 f – Dynamic and compiler scheduled routing Processor Core = TileCore+ Switch S

25 A Prototype Tiled Architecture: The Raw Microprocessor The Raw Chip Tile Disk stream Video1 DRAM Packet stream A Raw Tile SMEM SWITCH PC DMEM IMEM REG PC FPU ALU Raw Switch PC SMEM [Billion transistor IEEE Computer Issue ’97] Scalar operand network (SON): Capable of low latency transport of small (or large) packets [IEEE TPDS 2005]

26 Virtual reality Simulator reality Prototype reality Product reality

27 Scalar Operand Transport in Raw fmul r24, r3, r4 software controlled crossbar software controlled crossbar fadd r5, r3, r24 route P->E, N->S route W->P, S->N Goal: flow controlled, in order delivery of operands

28 RawCC: Distributed ILP Compilation (DILP) tmp0 = (seed*3+2)/2 tmp1 = seed*v1+2 tmp2 = seed*v2 + 2 tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 v3 = tmp3 - v2 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Black arrows = Operand Communication over SON [ASPLOS 1998] Partitioning Place, Route, Schedule C

29 Virtual reality Simulator reality Prototype reality Product reality

30 A Tiled Processor Architecture Prototype: the Raw Microprocessor October 02 Michael Taylor Walter Lee Jason Miller David Wentzlaff Ian Bratt Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim James Psota Arvind Saraf Nathan Shnidman Volker Strumpen Matt Frank Rajeev Barua Elliot Waingold Jonathan Babb Sri Devabhaktuni Saman Amarasinghe Anant Agarwal

31 Raw Die Photo IBM.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta) [ISCA 2004]

32 Raw Motherboard

33 Raw Ideas and Decisions: What Worked, What Did Not Build a complete prototype system Simple processor with single issue cores FPGA logic block in each tile Distributed ILP and static network Static network for streaming Multiple types of computation – ILP, streams, TLP, server PC in every tile

34 Why Build? Compiler (Amarasinghe), OS and runtimes (ISI), apps (ISI, Lincoln Labs, Durand) folks will not work with you unless you are serious about building hardware Need motivaion to build software tools -- compilers, runtimes, debugging, visualization – many challenges here Run large data sets (simulation takes forever even with 100 servers!) Many hard problems show up or are better understood after you begin building (how to maintain ordering for distributed ILP, slack for streaming codes) Have to solve hard problems – no magic! The more radical the idea, the more important it is to build – World will only trust end-to-end results since it is too hard to dive into details and understand all assumptions – Would you believe this: “Prof. John Bull has demonstrated a simulation prototype of a 64-way issue out-of-order superscalar” Cycle simulator became cycle accurate simulator only after HW got precisely defined Don’t bother to commercialize unless you have a working prototype Total network power few percent for real apps [Aug 2003 ISLPED, Kim et al. Energy characterization of a tiled architecture processor with on-chip networks] [MIT- CSAIL - TR Energy scalability of on-chip interconnection networks in multicore architecures ] – Network power is few percent in Raw for real apps; however, it is 36% only for a highly contrived synthetic sequence meant to toggle every network wire

35 Raw Ideas and Decisions: What Worked, What Did Not Build a complete prototype system Simple processor, single issue FPGA logic block in each tile Distributed ILP Static network for streaming Multiple types of computation – ILP, streams, TLP, server PC in every tile Yes 1GHz, 2-way, inorder in 2016 No Yes ‘02, No ‘06, Yes ‘14 Yes

36 software controlled crossbar software controlled crossbar route P->E, N->S route W->P, S->N Raw Ideas and Decisions: Streaming – Interconnect Support Forced synchronization in static network

37 sub r5, r3, r55 Dynamic Switch add r55, r3, r4 Dynamic Switch Catch all Streaming in Tilera’s Tile Processor TAG Streaming done over dynamic interconnect with stream demuxing (AsTrO SDS) Automatic demultiplexing of streams into registers Number of streams is virtualized

38 Virtual reality Simulator reality Prototype reality Product reality

39 Why Do We Care? Markets Demanding More Performance Wireless Networks - Demand for high thruput – more channels - Fast moving standards LTE, services Networking market - Demand for high performance – 10Gbps - Demand for more services, intelligence Digital Multimedia market - Demand for high performance – H.264 HD - Demand for more services – VoD, transcode Cable & Broadcast Video Conferencing Switches Security Appliances Routers … and with power efficiency and programming ease GGSN Base Station 39

40 Tilera’s TILEPro64™ Processor Power per tile (depending on app)170 – 300 mW Core power for h.264 encode (64 tiles) 12W Clock speed Up to 866 MHz I/O bandwidth40 Gbps Main Memory bandwidth200 Gbps Multicore Performance (90nm) Number of tiles64 Cache-coherent distributed cache5 MB 750MHz (32, 16, 8 bit) BOPS Bisection bandwidth2 Terabits per second Power Efficiency I/O and Memory Bandwidth Programming ANSI standard C SMP Linux programming Stream programming Product reality [Tile64, Hotchips 2007] [Tile64, Microprocessor Report Nov 2007]

41 PCIe 1 MAC PHY PCIe 1 MAC PHY PCIe 0 MAC PHY PCIe 0 MAC PHY Serdes Flexible IO GbE 0 GbE 1 Flexible IO UART, HPI JTAG, I2C, SPI UART, HPI JTAG, I2C, SPI DDR2 Memory Controller 3 DDR2 Memory Controller 0 DDR2 Memory Controller 2 DDR2 Memory Controller 1 XAUI MAC PHY 0 XAUI MAC PHY 0 Serdes XAUI MAC PHY 1 XAUI MAC PHY 1 Serdes Tile Processor Block Diagram A Complete System on a Chip PROCESSOR P2 Reg File P1P0 CACHE L2 CACHE L1IL1D ITLBDTLB 2D DMA STN MDN TDN UDN IDN SWITCH

42 Tile Processor NoC 5 independent non-blocking networks – 64 switches per network – 1 Terabit/sec per Tile Each network switch directly and independently connected to tiles One hop per clock on all networks I/O write example Memory write example Tile to Tile access example All accesses can be performed simultaneously on non-blocking networks UDN STN IDN MDN Tiles TDN VDN [IEEE Micro Sep 2007]

43 Multicore Hardwall Implementation Or Protection and Interconnects OS1/APP1 OS1/APP3 OS2/APP2 data valid Switch data valid Switch HARDWALL_ENABLE

44 Product Reality Differences Market forces – Need crisper answer to “who cares” – SMP Linux programming with pthreads – fully cache coherent – C + API approach to streaming vs new language Streamit in Raw – Special instructions for video, networking – Floating point needed in research project, but not in product for embedded market Lessons from Raw – E.g., Dynamic network for streams – HW instruction cache – Protected interconnects More substantial engineering – 3-way VLIW CPU, subword arithmetic – Engineering for clock speed and power efficiency – Completeness – I/O interfaces on chip – complete system chip. Just add DRAM for system – Support for virtual memory, 2D DMA – Runs SMP Linux (can run multiple OSes simultaneously)

45 Virtual reality Simulator reality Prototype reality Product reality

46 What Does the Future Look Like? Corollary of Moore’s law: Number of cores will double every 18 months ‘05‘08‘11‘ ‘02 16 Research Industry (Cores minimally big enough to run a self respecting OS!) 1K cores by 2014! Are we ready?

47 Vision for the Future The ‘core’ is the logic gate of the 21 st century p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s p m s

48 Research Challenges for 1K Cores 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change! Can we use 4 cores to get 2X through DILP? Remember cores will be 1GHz and simple! What is the interconnect? How should we program 1K cores? Can interconnect help with programming? Locality and reliability WILL matter for 1K cores. Spatial view of multicore? Can we add architectural support for programming ease? E.g., suppose I told you cores are free. Can you discover mechanisms to make programming easier? What is the right grain size for a core? How must our computational models change in the face of small memories per core? How to “feed the beast”? I/O and external memory bandwidth Can we assume perfect reliability any longer?

49 ATAC Architecture p switch m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m Optical Broadcast WDM Interconnect Electrical Mesh Interconnect (EMesh) [ Proc. BARC Jan 2007, MIT- CSAIL - TR ]

50 Research Challenges for 1K Cores 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change! Can we use 4 cores to get 2X through DILP? What is the interconnect? How should we program 1K cores? Can interconnect help with programming? Locality and reliability WILL matter for 1K cores. Spatial view of multicore? Can we add architectural support for programming ease? E.g., suppose I told you cores are free. Can you discover mechanisms to make programming easier? What is the right grain size for a core? How must our computational models change in the face of small memories per core? How to “feed the beast”? I/O and external memory bandwidth Can we assume perfect reliability any longer?

51 FOS – Factored Operating System Today: User app and OS kernel thrash each other in a core’s cache User/OS time sharing is inefficient Angstrom: OS assumes abstracted space model. OS services bound to distinct cores, separate from user cores. OS service cores collaborate to achieve best resource management User/OS space sharing is efficient The key idea: space sharing replaces time sharing OS File System FS User App FS OS cores collaborate, inspired by distributed internet services model Need new page I/O FS [OS Review 2008]

52 Research Challenges for 1K Cores 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change! Can we use 4 cores to get 2X through DILP? What is the interconnect? How should we program 1K cores? Can interconnect help with programming? Locality and reliability WILL matter for 1K cores. Spatial view of multicore? Can we add architectural support for programming ease? E.g., suppose I told you cores are free. Can you discover mechanisms to make programming easier? What is the right grain size for a core? How must our computational models change in the face of small memories per core? How to “feed the beast”? I/O and external memory bandwidth Can we assume perfect reliability any longer?

53 The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners.