A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

Slides:

Advertisements

Similar presentations

Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.

Advertisements

Systolic Arrays & Their Applications

A Novel 3D Layer-Multiplexed On-Chip Network

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

On-Chip Interconnects Alexander Grubb Jennifer Tam Jiri Simsa Harsha Simhadri Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. “Polymorphic.

Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Network based System on Chip Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

MICRO-MODEM RELIABILITY SOLUTION FOR NOC COMMUNICATIONS Arkadiy Morgenshtein, Evgeny Bolotin, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –

1 Evgeny Bolotin – ICECS 2004 Automatic Hardware-Efficient SoC Integration by QoS Network on Chip Electrical Engineering Department, Technion, Haifa, Israel.

ECE 526 – Network Processing Systems Design

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Capacity of Wireless Mesh Networks: Comparing Single- Radio, Dual-Radio, and Multi- Radio Networks By: Alan Applegate.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Power Issues in On-chip Interconnection Networks Mojtaba Amiri Nov. 5, 2009.

Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

CSE 661 PAPER PRESENTATION

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Department of Computer Science A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares Alexander Loukissas Amin Vahdat SIGCOMM’08 Reporter:

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,

The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.

Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.

University of Michigan, Ann Arbor

Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.

Yu Cai Ken Mai Onur Mutlu

Multi-Split-Row Threshold Decoding Implementations for LDPC Codes

Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.

An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric.

A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

ECE Department, University of California, Davis

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Architecture & Organization 1

Exploring Concentration and Channel Slicing in On-chip Network Router

Multi-Processing in High Performance Computer Architecture:

Israel Cidon, Ran Ginosar and Avinoam Kolodny

An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes

Architecture & Organization 1

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Energy Efficient Power Distribution on Many-Core SoC

Data Center Architectures

Presentation transcript:

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis

Outline Motivation –Why chip multiprocessors –The difference between chip-multiprocessor interconnect and multiple-chip interconnect Low-area asymmetric interconnect architecture High performance multiple-link architecture

Why Chip Multiprocessors Chip Multiprocessor challenges –Traditional high performance techniques are less practical (e.g., increased clock frequency) –Power dissipation is a key constraint –Global wires scale poorly in advanced technologies Solution: chip multiprocessors –Parallel computation for high performance –Reduce the clock freq. and voltage when full rate computation is not needed for high energy efficiency –Constrain wires no longer than the size of one core

Interconnect in Chip Multiprocessors vs. Interconnect in Multiple Chips Chip multiprocessors have relatively limited area resources for interconnect circuitry –Try to reduce the area of buffer with little reduction of the interconnect capability Chip multiprocessors have relatively abundant wire resources for inter-processor connection –Try to use more wire resources to increase the interconnect capability

Outline Motivation Low-area asymmetric interconnect architecture –Traditional dynamic Network-On-Chip –Statically configured Network-On-Chip –Proposed asymmetric architecture High performance multiple-link architecture

Background: Traditional Network on Chip (NoC) using Dynamic Routing Advantage: flexible Disadvantage: high area and power cost

Background: Static Nearest Neighbor Interconnect Architecture Low area cost –One buffer per processor, not four High latency for long distance communication –Data passes through each intermediate processor

Asymmetric Data Traffic in Inter-processor Communication Data traffic of routers in a 9-processor JPEG encoder 80% of traffic goes to processor core, 20% passes by core Network data words of input ports of router East North West South Relative 9%26%22%43% Network data words of output ports of router Core East North West South Relative 80%8%4%8%0%

Asymmetrically-Buffered Interconnect Architecture Large buffer only for the processing core Support flexible direct switch/route interconnect Asymmetric Dynamic packet Static

Outline Motivation Low-area asymmetric interconnect architecture High performance multiple-link architecture

Design Option Exploration Choose static routing architecture –Much smaller area and power required Assign two buffers (ports) for each processing core –Natural fit with 2-input, 1-output instructions (e.g., C=A+B)

Single Link vs. Multiple Links Why consider multi-link architectures? –Single link might be inefficient in some cases –There are many link/wire resources on chip Fully connected multi- link architectures have huge numbers of long wires –Alternative simplified architectures are needed Type1: one link per edge Fully connected, two links per edge

Architectures with Multiple Links type2 type3 type4 type5 type6 type7 Two links per edgeThree links per edgeFour links per edge One link dedicated to neighbor link All links the same

Area and Speed of Seven Proposed Network Architectures All seven architectures implemented in hardware Four-link architectures (types 6 and 7) require almost 25% more area Clock rates for all are within approx. 2%

Comparing Routing Latency using Communication Models All architectures have the same routing latency for the one→one, one→all, and all→all communication Different types of architectures behave differently for the all-one communication Routing latencies for all→one communication with n x n processors Type1 Type2/3 Type4/5 Example all→one linear communication for six processors

Summary Asymmetric buffer allocation topology uses 1 or 2 large buffers instead of 4 large buffers to support long distance communication –Provides 2 to 4 times area savings with little performance reduction Using 2 or 3 links per edge achieves good area/performance tradeoffs for chip multiprocessors containing simple single-issue processors –Provides 2 to 3 times higher all→one communication capacity with 5%-10% increased area

Acknowledgements Funding –Intel Corporation –UC Micro –NSF Grant No –CAREER award –SRC GRC Grant 1598 –IntellaSys Corporation –S Machines Special thanks –Members of the VCL, R. Krishnamurthy, M. Anders, and S. Mathew –ST Microelectronics and Artisan