Presentation on theme: "Tile Processors: Many-Core for Embedded and Cloud Computing"— Presentation transcript:
1Tile Processors: Many-Core for Embedded and Cloud Computing Everyone is going multicore.Putting a few cores together are easy to implement.The challenge is how to continue scaling the multicore chips to provide performance, performance per watt and performance per $The old way of doing thigs does not work we need a new way of doing thingsPowerfull, converged, and Low power coresA way to provide to connect them all together to scaleMaking sure they are standard programmingThis is what we focus on at Tilera:Powerfull yet power efficient cores:Not big cores but follow the kill ruleVery high performanceVery high performance per wattAdding many cores together on a chipUtilize every possible die spaceStandard programmingRichard Schooler VP Software Engineering Tilera Corporation1
2Exploiting Natural Parallelism High-performance applications have lots of parallelism!Embedded apps:Networking: packets, flowsMedia: streams, images, functional & data parallelismCloud apps:Many clients: network sessionsData mining: distributed data & computationLots of different levels:SIMD (fine-grain data parallelism)Thread/process (medium-grain task parallelism)Distributed system (coarse-grain job parallelism)
3Every one is going Manycore, but can the architecture scale? CoresThe computing world is ready for radical changePerformance& Performance/WGap32SunIBM CellLarrabee#CoresIntelPerformance421Time20052006201020142020
4Key “Many-Core” Challenges: The 3 P’s Performance challengeHow to scale from 1 to 1000 cores – the number of cores is the new MegahertzPower efficiency challengePerformance per watt is the new metric – systems are often constrained by power & coolingProgramming challengeHow to provide a converged many core solution in a standard programming environment
5Current technologies fail to deliver “Problems cannot be solved by the same level of thinking that created them.”Current technologies fail to deliverIncremental performance increaseHigh powerLow level of IntegrationIncreasingly bigger coresWe need to have a new thinking to get10 x performance10 x performance per wattConverged computingStandard programming models
6Stepping Back: How Did We Get Here? Moore’s Conundrum: More devices =>? More performanceOld answers: More complex cores; bigger cachesBut power-hungryNew answers: More coresBut do conventional approaches scale?Diminishing returns!
7The Old Challenge: CPU-on-a-chip 20 MIPS CPUin 1987Few thousand gates
8The Opportunity: Billions of Transistors Old CPU:What to do with all those transistors?
9Take Inspiration from ASICs memmemThe first point relates to free gatesThe second point relates to wire delaysNot general purpose (cannot run seq apps using ilp), and not even programmableLet’s look at how we want to use the tiles.Down on the bottom, we have these white boxes. These correspond to programs. We have a 4-way threaded JAVAapplication running on four tiles. We have an MPI program running on two tiles, and a http daemon running on one tile.The tile in the corner isn’t running anything and is in low-power mode.The eight tiles at the top are being employed in a more novel usage. We’re streaming in some video data in overthe pins, performing some sort of filter operation, and then streaming it out to a screen. What we want to do is havethe eight tiles work together, parallelizing the computation. We want to customize each tile to perform a certain computation,we want to connect the tiles over the network, and place them in a fashion to minimize congestion. This is very muchlike the process of a designing a customized hardware circuit.In order to make all of this work, we need to have a really fast network to do this. Some multi-processor networks forMPI programs have latencies on the order of a 1000 cycles. The really state-of-the-art multiprocessor networks havecommunication latency on the order of 30 cycles. We’re going to do it in 3 cycles.ASICs have high performance and low powerCustom-routed, short wiresLots of ALUs, registers, memories – huge on-chip parallelismBut how to build a programmable chip?
10Replace Long Wires with Routed Interconnect CtrlWe can replace the crossbar with a point-to-point, pipelined, routed network.[IEEE Computer ’97]
11From Centralized Clump of CPUs … ALUBypass NetRFOne idea we can try to exploit here is locality.
12… To Distributed ALUs, Routed Bypass Network We can replace the crossbar with a point-to-point, pipelined, routed network.Scalar Operand Network (SON) [TPDS 2005]
13From a Large Centralized Cache… We can replace the crossbar with a point-to-point, pipelined, routed network.
14…to a Distributed Shared Cache ALUR$We can replace the crossbar with a point-to-point, pipelined, routed network.[ISCA 1999]
15Distributed Everything + Routed Interconnect Tiled Multicore $ALUALUALURALUWe can replace the crossbar with a point-to-point, pipelined, routed network.ALUALUEach tile is a processor, so programmable
16Tiled Multicore Captures ASIC Benefits and is Programmable Scales to large numbers of coresModular – design and verify 1 tilePower efficientShort wires plus locality opts – CV2fChandrakasan effect, more cores at lower freq and voltage – CV2fProcessor Core= TileCore+ SwitchSCurrent Bus Architecture
17Tilera processor portfolio Demonstrating the scale of many-core TILE-Gx100100 coresGx64 & Gx100Up to 8x performanceTILE-Gx6464 coresTILEPro64TILE-Gx3636 coresTILE64Gx16 & Gx362x the performanceTILE-Gx1616 coresTILEPro362007200820092010. . .
19Memory Controller (DDR3) TILE-Gx36™: Scaling to a broad range of applications36 Processor Cores866M, 1.2GHz, 1.5GHz clk12 MBytes total cache40 Gbps total packet I/O4 ports 10GbE (XAUI)16 ports 1GbE (SGMII)48 Gbps PCIe I/O2 16Gbps Stream IO portsWire-speed packet engine60MppsMiCA engine:20 Gbps cryptoCompress & decompressFlexibleI/OUARTx2,USBx2,JTAG,I2C, SPIMiCAmPIPE10 GbEXAUISerDes4x GbESGMIIPCIe 2.08-lane4-laneMemory Controller (DDR3)
20Full-Featured General Converged Cores ProcessorEach core is a complete computer3-way VLIW CPUSIMD instructions: 32, 16, and 8-bit opsInstructions for video (e.g., SAD) and networkingProtection and interruptsMemoryL1 cache and L2 CacheVirtual and physical address spaceInstruction and data TLBsCache integrated 2D DMA engineRuns SMP LinuxRuns off-the-shelf C/C++ programsSignal processing and general appsCoreCache16K L1-I64K L2I-TLBD-TLB2D DMA8K L1-DRegister FileThree Execution PipelinesTerabitSwitch
21Software must complement the hardware Enable re-use of existing code-bases Standards-based development environmente.g. gcc, C, C++, Java, LinuxComprehensive command-line & GUI-based toolsSupport multiple OS modelsOne OS running SMPMultiple virtualized OS’s with protectionBare metal or “zero-overhead” with background OS environmentSupport range of parallel programming stylesThreaded programming (pThreads, TBB)Run-to-Completion with load-balancingDecomposition & PipeliningHigher-level frameworks (Erlang, OpenMP, Hadoop etc.)21
22Standards & open source integration Software RoadmapStandards & open source integrationCompiler: gcc, gLinux:Kernel: Tile architecture integrated toUser-space: glibc, broader set of standard packagesExtended programming and runtime environmentsJava: porting OpenJDKVirtualization: porting KVM
23Tile architecture: The future of many-core computing Multicore is the way forwardBut we need the right architecture to utilize itThe Tile architecture addresses the challengesScales to 100’s of coresDelivers very low powerRuns your existing codeStandards-based softwareFamiliar toolsFull range of standard programming environments23
28High single core performance Comparable to Atom & ARM Cortex-A9 cores Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark websiteTILE-Gx and single thread Atom results were measured in Tilera labsSingle core, single thread result for ARM is calculated based on chip scores
29Significant value across multiple markets NetworkingMultimediaWirelessCloudClassificationL4-7 ServicesLoad BalancingMonitoring/QoSSecurityVideo ConferencingMedia StreamingTranscodingBase StationMedia GatewayService NodesTest EquipmentApacheMemcachedWeb ApplicationsLAMP stackHigh PerformanceWith these capabilities, Tilera enables many applications in key vertical market segments; from Networking to Multimedia to Wireless, and Cloud.The stark reality is that current processor technology is limiting the deployment of many applications integral to Web2.0, Security, multimedia and wireless connectivity. Many applications demand much more scalable performance but power and size are key challenges facing many system developers. No more!Low PowerStandard ProgrammingOver 100 customersOver 40 customers going into productionTier 1 customers in all target markets2929
30Targeting markets with highly parallel applications Web ServingIn Memory CacheData MiningWebTranscodingVideo deliveryWireless mediaMedia deliveryLawful interceptionSurveillanceOtherGovernmentCommon ThemesHundreds and Thousands of servers running each applicationthousands of parallel transactionsAll need better performance and power efficiency
312 Dimensional mesh network 200 Tbps on-chip bandwidth The Tile Processor Architecture Mesh interconnect, power-optimized coresSProcessorCore2 Dimensional mesh networkCore + Switch = TileScales to large numbers of coresModular: Design-and-verify 1 tilePower efficient:Short wires & locality optimize CV2fChandrakasan effect, more cores at lower freq and voltage – CV2f200 Tbps on-chip bandwidthTraditional Bus/Ring Architecture
32Distributed “everything” Cache, memory management, connectivity Big centralized caches don’t scaleContentionLong latencyHigh powerDistributed caches have numerous benefitsLower power (less logic lit-up per access)Exploit locality (local L1 & L2)Can exploit various cache placement mechanisms to enhance performanceDistributed cachesCompletelyHW CoherentCacheSystemGlobal 64-bit address space
33Highest compute density 2U form factor4 hot pluggable modules8 Tilera TILEPro processors512 general purpose cores1.3 trillion operations /sec
34Power efficient and eco-friendly server 10,000 cores in a 8 Kilowatt rack35-50 watts max per nodeServer power of 400 watts90%+ efficient power suppliesShared fans and power supplies
35Coherent distributed cache system Globally Shared Physical Address SpaceFull Hardware Cache CoherenceStandard shared memory programming modelDistributed cacheEach tile has local L1 and L2 cachesAggregate of L2 serves as a globally shared L3Any cache block can be replicated locallyHardware tracks sharers, invalidates stale copiesDynamic Distributed Cache (DDC™)Memory pages distributed across all cores or homed by allocating coreCoherent I/OHardware maintains coherenceI/O reads/writes coherent with tile cachesReads/writes delivered by HW to home cacheHeader/packet delivered directly to tile cachesSerDesGbE 0GbE 1Flexible I/OUARTJTAGSPI, I2CDDR2 Controller 3DDR2 Controller 0DDR2 Controller 2DDR2 Controller 1PCIe 0PCIe 1XAUI 0XAUI 1CompletelyCoherentSystemFirst Gen then second gen comparison at the end