TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Performance Characterization of the Tile Architecture Précis Presentation Dr. Matthew Clark, Dr. Eric Grobelny, Andrew White Honeywell Defense & Space,
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Chapter Hardwired vs Microprogrammed Control Multithreading
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
OS Spring’03 Introduction Operating Systems Spring 2003.
Chapter 7 Interupts DMA Channels Context Switching.
Figure 1.1 Interaction between applications and the operating system.
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
Tanenbaum 8.3 See references
Computer System Architectures Computer System Software
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Jakub Szefer, Eric Keller, Ruby B. Lee Jennifer Rexford Princeton University CCS October, 2011 報告人:張逸文.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
COMP 2003: Assembly Language and Digital Logic Chapter 7: Computer Architecture Notes by Neil Dickson.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
Srihari Makineni & Ravi Iyer Communications Technology Lab
CSE 661 PAPER PRESENTATION
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Super computers Parallel Processing By Lecturer: Aisha Dawood.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Lecture 2: Computer Architecture: A Science ofTradeoffs.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Introduction to virtualization
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CSC 360- Instructor: K. Wu Review of Computer Organization.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Introduction to Operating Systems Concepts
Md Baitul Al Sadi, Isaac J. Cushman, Lei Chen, Rami J. Haddad
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
NFV Compute Acceleration APIs and Evaluation
Chapter 1: Introduction
William Stallings Computer Organization and Architecture 8th Edition
High-performance tracing of many-core systems with LTTng
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
OS Virtualization.
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Today’s agenda Hardware architecture and runtime system
Virtual Memory Overcoming main memory size limitation
Computer Architecture: A Science of Tradeoffs
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
Operating System Introduction.
Presentation transcript:

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

2Computer Systems and Platforms Lab Outlines Architecture Overview Motivation Specification of TILE-Gx8036 processors Performance evaluations Computational performance evaluation Memory performance evaluation Conclusion

3Computer Systems and Platforms Lab Motivation of Tilera architectures

4Computer Systems and Platforms Lab Motivation Dr. Anant Agarwal A founder of Tilera Corp. Computer architecture researcher, professor of EECS at MIT He led Alewife project and Raw architecture project MIT Alewife project (1990 ~ 1999) Alewife : a large scale multiprocessor Cache-coherent, distributed shared memory and user-level massage-passing in a single integrated hardware framework Raw Processor (1997 ~ 2007) Tiled multicore architecture Wire efficient multicore architecture (interconnection between tiles) Highly parallel VLSI, Compiler knows low-level details of the hardware 2002

5Computer Systems and Platforms Lab Motivation Scalar Operand Networks [IEEE TPDS] : Challenges and overcomes in the design of scalable Scalar Operand Networks Frequency Scalability Bandwidth Scalability Deadlock and Starvation Handling Exceptional Events Efficient Operation-Operand Matching Tiled multicore Distributed everything + Routed interconnection Replace long wires with routed interconnect From centralized clump of CPUs to distributed ALUs, Routed Bypass Network From a large centralized cache to a distributed shared cache

6Computer Systems and Platforms Lab Specification of TILE-Gx8036 processors

7Computer Systems and Platforms Lab TILE-Gx cores DDR3 DRAM Rshim  Boot controls, diagstics TRIO  Transactional I/O with DMA mPIPE  Packet management MiCA  Hardware accellerators  Crypto & Compression

8Computer Systems and Platforms Lab TILE-Gx8036 Each core Processor  1.2 GHz  64 bits addressing mode  3 way VLIW CPU Storage  32 KB L1I / L1D Cache  256 KB L2 Cache  9MB coherent L3 cache : Dynamic Distributed Cache

9Computer Systems and Platforms Lab Processor Pipelines Processor pipelines It consists of 6 main stages  Fetch, Branch Predict, Decode, Execute 0, Execute 1, and Write Back

10Computer Systems and Platforms Lab Processor Pipelines Pipeline latencies

11Computer Systems and Platforms Lab Switch Interfaces IDN : Internal dynamic networks UDN : User dynamic networks RDN : Memory response networks QDN : Memory request networks SDN : Shared dynamic networks

12Computer Systems and Platforms Lab Operating systems/Processes isolation Hardwall Prevent unwanted communication between user applications running on adjacent tiles  Programmable protection bit on each outport of the UDN or STN Hardwall also provides a powerful virtualization tool

13Computer Systems and Platforms Lab Network Arbitration Packets requiring the same output port are blocked until the current packet has finished routing It basically use round robin manner  Round robin  Network priority round robin Routing algorithm  X dimension is checked first  Y dimension is checked as follows

14Computer Systems and Platforms Lab System Software Stack Tile Processor Hardware Hypervisor Supervisor : Tile Linux Applications / User 4 different modes for tiles Standard : SMP Tile Linux (2.6.38) Dataplane : Zero Overhead Linux Bare metal environments : User-created run-time environment Dedicated : Tile for debugging

15Computer Systems and Platforms Lab Bare metal environment Bare Metal Environment Run-time environment that allows users to run applications that require direct access to the hardware Abilities  Full access to all hardware resources  Install interrupt vectors  Virtual/physical memory allocator  I/O device setup  UDN/IDN (also can communicate with SMP Linux)  Libc utilities that do not depend on OS system services

16Computer Systems and Platforms Lab Power management Dynamic voltage and frequency scaling (DVS, DFS) are available Configurable I/O and accelerator shutdowns Hardware-initiated zero-latency Tile sleep Software-initiated low-power Tile NAP mode

17Computer Systems and Platforms Lab Multicore Development Environment TILEmpower-Gx Development environment X86 Host machine bern.snu.ac.kr -MDE 4.1/ RPM - Operating systems Multicore profiler/debugger Evaluation platforms KVM, IDE, gcc, and so on $ tile-monitor -flags

18Computer Systems and Platforms Lab Computational performance evaluation

19Computer Systems and Platforms Lab Computational performance evaluation Benchmark scenario Matrix Multiplication with OpenMP C (1000 by 1000) = A (1000 by 1000) X B (1000 by1000) Performance

20Computer Systems and Platforms Lab Memory performance evaluation

21Computer Systems and Platforms Lab Memory performance for each core Memory access cycles for each core on ZOL (Zero Overhead Linux) Blue : load buffer0 in node0 / Green : load buffer1 in node1 Tile Tile Tile Memory Node 0 Buffer 0 Memory Node 1 Buffer 1 Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile 35 *** Faster row Legend : the number of cycles

22Computer Systems and Platforms Lab Memory performance for each core Memory access cycles for each core on BME (Bare Metal Environment) Blue : load buffer0 in node0 / Green : load buffer1 in node1 Tile Tile Tile Memory Node 0 Buffer 0 Memory Node 1 Buffer 1 Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Faster row Legend : the number of cycles

23Computer Systems and Platforms Lab Memory controller Memory controller block diagram

24Computer Systems and Platforms Lab Thank you