TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18
2Computer Systems and Platforms Lab Outlines Architecture Overview Motivation Specification of TILE-Gx8036 processors Performance evaluations Computational performance evaluation Memory performance evaluation Conclusion
3Computer Systems and Platforms Lab Motivation of Tilera architectures
4Computer Systems and Platforms Lab Motivation Dr. Anant Agarwal A founder of Tilera Corp. Computer architecture researcher, professor of EECS at MIT He led Alewife project and Raw architecture project MIT Alewife project (1990 ~ 1999) Alewife : a large scale multiprocessor Cache-coherent, distributed shared memory and user-level massage-passing in a single integrated hardware framework Raw Processor (1997 ~ 2007) Tiled multicore architecture Wire efficient multicore architecture (interconnection between tiles) Highly parallel VLSI, Compiler knows low-level details of the hardware 2002
5Computer Systems and Platforms Lab Motivation Scalar Operand Networks [IEEE TPDS] : Challenges and overcomes in the design of scalable Scalar Operand Networks Frequency Scalability Bandwidth Scalability Deadlock and Starvation Handling Exceptional Events Efficient Operation-Operand Matching Tiled multicore Distributed everything + Routed interconnection Replace long wires with routed interconnect From centralized clump of CPUs to distributed ALUs, Routed Bypass Network From a large centralized cache to a distributed shared cache
6Computer Systems and Platforms Lab Specification of TILE-Gx8036 processors
7Computer Systems and Platforms Lab TILE-Gx cores DDR3 DRAM Rshim Boot controls, diagstics TRIO Transactional I/O with DMA mPIPE Packet management MiCA Hardware accellerators Crypto & Compression
8Computer Systems and Platforms Lab TILE-Gx8036 Each core Processor 1.2 GHz 64 bits addressing mode 3 way VLIW CPU Storage 32 KB L1I / L1D Cache 256 KB L2 Cache 9MB coherent L3 cache : Dynamic Distributed Cache
9Computer Systems and Platforms Lab Processor Pipelines Processor pipelines It consists of 6 main stages Fetch, Branch Predict, Decode, Execute 0, Execute 1, and Write Back
10Computer Systems and Platforms Lab Processor Pipelines Pipeline latencies
11Computer Systems and Platforms Lab Switch Interfaces IDN : Internal dynamic networks UDN : User dynamic networks RDN : Memory response networks QDN : Memory request networks SDN : Shared dynamic networks
12Computer Systems and Platforms Lab Operating systems/Processes isolation Hardwall Prevent unwanted communication between user applications running on adjacent tiles Programmable protection bit on each outport of the UDN or STN Hardwall also provides a powerful virtualization tool
13Computer Systems and Platforms Lab Network Arbitration Packets requiring the same output port are blocked until the current packet has finished routing It basically use round robin manner Round robin Network priority round robin Routing algorithm X dimension is checked first Y dimension is checked as follows
14Computer Systems and Platforms Lab System Software Stack Tile Processor Hardware Hypervisor Supervisor : Tile Linux Applications / User 4 different modes for tiles Standard : SMP Tile Linux (2.6.38) Dataplane : Zero Overhead Linux Bare metal environments : User-created run-time environment Dedicated : Tile for debugging
15Computer Systems and Platforms Lab Bare metal environment Bare Metal Environment Run-time environment that allows users to run applications that require direct access to the hardware Abilities Full access to all hardware resources Install interrupt vectors Virtual/physical memory allocator I/O device setup UDN/IDN (also can communicate with SMP Linux) Libc utilities that do not depend on OS system services
16Computer Systems and Platforms Lab Power management Dynamic voltage and frequency scaling (DVS, DFS) are available Configurable I/O and accelerator shutdowns Hardware-initiated zero-latency Tile sleep Software-initiated low-power Tile NAP mode
17Computer Systems and Platforms Lab Multicore Development Environment TILEmpower-Gx Development environment X86 Host machine bern.snu.ac.kr -MDE 4.1/ RPM - Operating systems Multicore profiler/debugger Evaluation platforms KVM, IDE, gcc, and so on $ tile-monitor -flags
18Computer Systems and Platforms Lab Computational performance evaluation
19Computer Systems and Platforms Lab Computational performance evaluation Benchmark scenario Matrix Multiplication with OpenMP C (1000 by 1000) = A (1000 by 1000) X B (1000 by1000) Performance
20Computer Systems and Platforms Lab Memory performance evaluation
21Computer Systems and Platforms Lab Memory performance for each core Memory access cycles for each core on ZOL (Zero Overhead Linux) Blue : load buffer0 in node0 / Green : load buffer1 in node1 Tile Tile Tile Memory Node 0 Buffer 0 Memory Node 1 Buffer 1 Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile 35 *** Faster row Legend : the number of cycles
22Computer Systems and Platforms Lab Memory performance for each core Memory access cycles for each core on BME (Bare Metal Environment) Blue : load buffer0 in node0 / Green : load buffer1 in node1 Tile Tile Tile Memory Node 0 Buffer 0 Memory Node 1 Buffer 1 Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Faster row Legend : the number of cycles
23Computer Systems and Platforms Lab Memory controller Memory controller block diagram
24Computer Systems and Platforms Lab Thank you