Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tile Processors: Many-Core for Embedded and Cloud Computing

Similar presentations

Presentation on theme: "Tile Processors: Many-Core for Embedded and Cloud Computing"— Presentation transcript:

1 Tile Processors: Many-Core for Embedded and Cloud Computing
Everyone is going multicore. Putting a few cores together are easy to implement. The challenge is how to continue scaling the multicore chips to provide performance, performance per watt and performance per $ The old way of doing thigs does not work we need a new way of doing things Powerfull, converged, and Low power cores A way to provide to connect them all together to scale Making sure they are standard programming This is what we focus on at Tilera: Powerfull yet power efficient cores: Not big cores but follow the kill rule Very high performance Very high performance per watt Adding many cores together on a chip Utilize every possible die space Standard programming Richard Schooler VP Software Engineering Tilera Corporation 1

2 Exploiting Natural Parallelism
High-performance applications have lots of parallelism! Embedded apps: Networking: packets, flows Media: streams, images, functional & data parallelism Cloud apps: Many clients: network sessions Data mining: distributed data & computation Lots of different levels: SIMD (fine-grain data parallelism) Thread/process (medium-grain task parallelism) Distributed system (coarse-grain job parallelism)

3 Every one is going Manycore, but can the architecture scale?
Cores The computing world is ready for radical change Performance & Performance/W Gap 32 Sun IBM Cell Larrabee #Cores Intel Performance 4 2 1 Time 2005 2006 2010 2014 2020

4 Key “Many-Core” Challenges: The 3 P’s
Performance challenge How to scale from 1 to 1000 cores – the number of cores is the new Megahertz Power efficiency challenge Performance per watt is the new metric – systems are often constrained by power & cooling Programming challenge How to provide a converged many core solution in a standard programming environment

5 Current technologies fail to deliver
“Problems cannot be solved by the same level of thinking that created them.” Current technologies fail to deliver Incremental performance increase High power Low level of Integration Increasingly bigger cores We need to have a new thinking to get 10 x performance 10 x performance per watt Converged computing Standard programming models

6 Stepping Back: How Did We Get Here?
Moore’s Conundrum: More devices =>? More performance Old answers: More complex cores; bigger caches But power-hungry New answers: More cores But do conventional approaches scale? Diminishing returns!

7 The Old Challenge: CPU-on-a-chip
20 MIPS CPU in 1987 Few thousand gates

8 The Opportunity: Billions of Transistors
Old CPU: What to do with all those transistors?

9 Take Inspiration from ASICs
mem mem The first point relates to free gates The second point relates to wire delays Not general purpose (cannot run seq apps using ilp), and not even programmable Let’s look at how we want to use the tiles. Down on the bottom, we have these white boxes. These correspond to programs. We have a 4-way threaded JAVA application running on four tiles. We have an MPI program running on two tiles, and a http daemon running on one tile. The tile in the corner isn’t running anything and is in low-power mode. The eight tiles at the top are being employed in a more novel usage. We’re streaming in some video data in over the pins, performing some sort of filter operation, and then streaming it out to a screen. What we want to do is have the eight tiles work together, parallelizing the computation. We want to customize each tile to perform a certain computation, we want to connect the tiles over the network, and place them in a fashion to minimize congestion. This is very much like the process of a designing a customized hardware circuit. In order to make all of this work, we need to have a really fast network to do this. Some multi-processor networks for MPI programs have latencies on the order of a 1000 cycles. The really state-of-the-art multiprocessor networks have communication latency on the order of 30 cycles. We’re going to do it in 3 cycles. ASICs have high performance and low power Custom-routed, short wires Lots of ALUs, registers, memories – huge on-chip parallelism But how to build a programmable chip?

10 Replace Long Wires with Routed Interconnect
Ctrl We can replace the crossbar with a point-to-point, pipelined, routed network. [IEEE Computer ’97]

11 From Centralized Clump of CPUs …
ALU Bypass Net RF One idea we can try to exploit here is locality.

12 … To Distributed ALUs, Routed Bypass Network
We can replace the crossbar with a point-to-point, pipelined, routed network. Scalar Operand Network (SON) [TPDS 2005]

13 From a Large Centralized Cache…
We can replace the crossbar with a point-to-point, pipelined, routed network.

14 …to a Distributed Shared Cache
ALU R $ We can replace the crossbar with a point-to-point, pipelined, routed network. [ISCA 1999]

15 Distributed Everything + Routed Interconnect  Tiled Multicore
$ ALU ALU ALU R ALU We can replace the crossbar with a point-to-point, pipelined, routed network. ALU ALU Each tile is a processor, so programmable

16 Tiled Multicore Captures ASIC Benefits and is Programmable
Scales to large numbers of cores Modular – design and verify 1 tile Power efficient Short wires plus locality opts – CV2f Chandrakasan effect, more cores at lower freq and voltage – CV2f Processor Core = Tile Core + Switch S Current Bus Architecture

17 Tilera processor portfolio Demonstrating the scale of many-core
TILE-Gx100 100 cores Gx64 & Gx100 Up to 8x performance TILE-Gx64 64 cores TILEPro64 TILE-Gx36 36 cores TILE64 Gx16 & Gx36 2x the performance TILE-Gx16 16 cores TILEPro36 2007 2008 2009 2010 . . .

18 TILE-Gx100™: Complete System-on-a-Chip with 100 64-bit cores
1.2GHz – 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps iMesh BW Gbps packet I/O 8 ports XAUI / 2 XAUI 2 40Gb Interlaken 32 ports 1GbE (SGMII) 80 Gbps PCIe I/O 3 StreamIO ports (20Gb) Wire-speed packet eng. 120Mpps MiCA engines: 40 Gbps crypto compress & decompress Memory Controller (DDR3) Memory Controller (DDR3) Interlaken 10 GbE XAUI SerDes 4x GbE SGMII MiCA 10 GbE XAUI SerDes 4x GbE SGMII UART x2, USB x2, JTAG, I2C, SPI 10 GbE XAUI SerDes 4x GbE SGMII SerDes PCIe 2.0 8-lane 10 GbE XAUI SerDes 4x GbE SGMII SerDes PCIe 2.0 8-lane mPIPE Interlaken 10 GbE XAUI SerDes 4x GbE SGMII SerDes PCIe 2.0 4-lane 10 GbE XAUI SerDes 4x GbE SGMII Flexible I/O 10 GbE XAUI SerDes 4x GbE SGMII MiCA 10 GbE XAUI SerDes 4x GbE SGMII Memory Controller (DDR3) Memory Controller (DDR3)

19 Memory Controller (DDR3)
TILE-Gx36™: Scaling to a broad range of applications 36 Processor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache 40 Gbps total packet I/O 4 ports 10GbE (XAUI) 16 ports 1GbE (SGMII) 48 Gbps PCIe I/O 2 16Gbps Stream IO ports Wire-speed packet engine 60Mpps MiCA engine: 20 Gbps crypto Compress & decompress Flexible I/O UARTx2, USBx2, JTAG, I2C, SPI MiCA mPIPE 10 GbE XAUI SerDes 4x GbE SGMII PCIe 2.0 8-lane 4-lane Memory Controller (DDR3)

20 Full-Featured General Converged Cores
Processor Each core is a complete computer 3-way VLIW CPU SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and networking Protection and interrupts Memory L1 cache and L2 Cache Virtual and physical address space Instruction and data TLBs Cache integrated 2D DMA engine Runs SMP Linux Runs off-the-shelf C/C++ programs Signal processing and general apps Core Cache 16K L1-I 64K L2 I-TLB D-TLB 2D DMA 8K L1-D Register File Three Execution Pipelines Terabit Switch

21 Software must complement the hardware Enable re-use of existing code-bases
Standards-based development environment e.g. gcc, C, C++, Java, Linux Comprehensive command-line & GUI-based tools Support multiple OS models One OS running SMP Multiple virtualized OS’s with protection Bare metal or “zero-overhead” with background OS environment Support range of parallel programming styles Threaded programming (pThreads, TBB) Run-to-Completion with load-balancing Decomposition & Pipelining Higher-level frameworks (Erlang, OpenMP, Hadoop etc.) 21

22 Standards & open source integration
Software Roadmap Standards & open source integration Compiler: gcc, g Linux: Kernel: Tile architecture integrated to User-space: glibc, broader set of standard packages Extended programming and runtime environments Java: porting OpenJDK Virtualization: porting KVM

23 Tile architecture: The future of many-core computing
Multicore is the way forward But we need the right architecture to utilize it The Tile architecture addresses the challenges Scales to 100’s of cores Delivers very low power Runs your existing code Standards-based software Familiar tools Full range of standard programming environments 23

24 Thank you! Questions?

25 Research Vision to Commercial Product
The opportunity The future? 1B Transistors in 2007 1996 TILE-Gx100 100 cores 100B transistors Memory Controller (DDR3) mPIPE Flexible I/O I2C, SPI USB x2, UART x2, JTAG, MiCA SerDes PCIe 2.0 8-lane 4-lane Interlaken 10 GbE XAUI SGMII 4x GbE 2018 CPU Mem 1997 A blank slate 2010 MIT Raw 16 cores 2007 Tile Processor 64 cores 2002

26 Standard tools and programming model
Multicore Development Environment Standards-based tools Standard application stack Standard programming SMP Linux 2.6 ANSI C/C++ Java, PHP Integrated tools GCC compiler Standard gdb gprof Eclipse IDE Innovative tools Multicore debug Multicore profile Application layer Open source apps Standard C/C++ libs Operating System layer 64-way SMP Linux Zero Overhead Linux Bare metal environment Hypervisor layer Virtualizes hardware I/O device drivers Applications Applications libraries Operating System kernel drivers Hypervisor Virtualization and high speed I/O drivers Tile Processor Tile Tile Tile Tile 26

27 Standard Software Stack
Management Protocols IPMI 2.0 Infrastructure Apps Network Monitoring Transcoding Language Support Perl Compiler, OS Hypervisor Commercial Linux Distribution gcc & g++

28 High single core performance Comparable to Atom & ARM Cortex-A9 cores
Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website TILE-Gx and single thread Atom results were measured in Tilera labs Single core, single thread result for ARM is calculated based on chip scores

29 Significant value across multiple markets
Networking Multimedia Wireless Cloud Classification L4-7 Services Load Balancing Monitoring/QoS Security Video Conferencing Media Streaming Transcoding Base Station Media Gateway Service Nodes Test Equipment Apache Memcached Web Applications LAMP stack High Performance With these capabilities, Tilera enables many applications in key vertical market segments; from Networking to Multimedia to Wireless, and Cloud. The stark reality is that current processor technology is limiting the deployment of many applications integral to Web2.0, Security, multimedia and wireless connectivity. Many applications demand much more scalable performance but power and size are key challenges facing many system developers. No more! Low Power Standard Programming Over 100 customers Over 40 customers going into production Tier 1 customers in all target markets 29 29

30 Targeting markets with highly parallel applications
Web Serving In Memory Cache Data Mining Web Transcoding Video delivery Wireless media Media delivery Lawful interception Surveillance Other Government Common Themes Hundreds and Thousands of servers running each application thousands of parallel transactions All need better performance and power efficiency

31 2 Dimensional mesh network 200 Tbps on-chip bandwidth
The Tile Processor Architecture Mesh interconnect, power-optimized cores S Processor Core 2 Dimensional mesh network Core + Switch = Tile Scales to large numbers of cores Modular: Design-and-verify 1 tile Power efficient: Short wires & locality optimize CV2f Chandrakasan effect, more cores at lower freq and voltage – CV2f 200 Tbps on-chip bandwidth Traditional Bus/Ring Architecture

32 Distributed “everything” Cache, memory management, connectivity
Big centralized caches don’t scale Contention Long latency High power Distributed caches have numerous benefits Lower power (less logic lit-up per access) Exploit locality (local L1 & L2) Can exploit various cache placement mechanisms to enhance performance Distributed caches Completely HW Coherent Cache System Global 64-bit address space

33 Highest compute density
2U form factor 4 hot pluggable modules 8 Tilera TILEPro processors 512 general purpose cores 1.3 trillion operations /sec

34 Power efficient and eco-friendly server
10,000 cores in a 8 Kilowatt rack 35-50 watts max per node Server power of 400 watts 90%+ efficient power supplies Shared fans and power supplies

35 Coherent distributed cache system
Globally Shared Physical Address Space Full Hardware Cache Coherence Standard shared memory programming model Distributed cache Each tile has local L1 and L2 caches Aggregate of L2 serves as a globally shared L3 Any cache block can be replicated locally Hardware tracks sharers, invalidates stale copies Dynamic Distributed Cache (DDC™) Memory pages distributed across all cores or homed by allocating core Coherent I/O Hardware maintains coherence I/O reads/writes coherent with tile caches Reads/writes delivered by HW to home cache Header/packet delivered directly to tile caches SerDes GbE 0 GbE 1 Flexible I/O UART JTAG SPI, I2C DDR2 Controller 3 DDR2 Controller 0 DDR2 Controller 2 DDR2 Controller 1 PCIe 0 PCIe 1 XAUI 0 XAUI 1 Completely Coherent System First Gen then second gen comparison at the end

Download ppt "Tile Processors: Many-Core for Embedded and Cloud Computing"

Similar presentations

Ads by Google