CSE 661 PAPER PRESENTATION

Slides:



Advertisements
Similar presentations
Data Communications and Networking
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
10 - Network Layer. Network layer r transport segment from sending to receiving host r on sending side encapsulates segments into datagrams r on rcving.
Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
Figure 1.1 Interaction between applications and the operating system.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Mobile IP Performance Issues in Practice. Introduction What is Mobile IP? –Mobile IP is a technology that allows a "mobile node" (MN) to change its point.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Synchronization and Communication in the T3E Multiprocessor.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
The University of New Hampshire InterOperability Laboratory Introduction To PCIe Express © 2011 University of New Hampshire.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
The Tile Processor: A 64-Core Multicore for Embedded Processing Anant Agarwal Tilera Corporation HPEC 2007.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
1 Next Few Classes Networking basics Protection & Security.
Parallel Computer Architecture and Interconnect 1b.1.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
Forwarding.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Super computers Parallel Processing
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Presented by: Nick Kirchem Feb 13, 2004
Overview Parallel Processing Pipelining
IOS Network Model 2nd semester
CMSC 611: Advanced Computer Architecture
Yiannis Nikolakopoulos
Lecture 18: Coherence and Synchronization
Presentation transcript:

CSE 661 PAPER PRESENTATION ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g201002240

OUTLINE OF PRESENTATION Introduction Tile64 Architecture Interconnect Hardware Network Uses Network to Tile Interface Receive-side Hardware Demultiplexing Protection Shared Memory Communication and Ordering Interconnect Software Communication Interface Applications Conclusion

INTRODUCTION Tile Processor’s five on-chip 2D mesh networks differ from traditional bus based scheme; requires global broadcast hence not scalable beyond 8 – 16 cores 1D ring not scalable; bisection BW is constant Can support few or many processors with minimal changes to network structure

TILE64 ARCHITECTURE 2D grid of 64 identical compute elements (tiles) arranged in 8 x 8 mesh 1GHz clock, 3-way VLIW, 192 bil. 32-bit instructions/sec 4.8MB distributed cache, per tile TLB Supports DMA and virtual memory Tiles may run independent OSs. May be combined to run multiprocessor OS such as SMP Linux Shared memory. Cores directly access other cores’ cache through on-chip interconnects

TILE64 ARCHITECTURE (2) Off chip memory BW ≤ 200Gbps I/O BW ≥ 40Gbps

TILE64 ARCHITECTURE (3) Courtesy: http://www.tilera.com/products/processors/TILE64

INTERCONNECT HARDWARE 5 low latency mesh networks Each network connects tile in five directions; north, south, east, west and processor Each link made of two 32-bit unidirectional links

INTERCONNECT HARDWARE(2) 1.28Tb/s BW in and out of a single tile

NETWORK USES 4 dynamic networks 1 static network packet header contains destination’s (x, y) coordinate and packet length (≤128 words) Flow controlled, reliable delivery UDN: low latency comm. between userland processes without OS intervention IDN: direct communication with I/O devices MDN: communication with off-chip memory TDN: direct tile-to-tile transfers; requests through TDN, response through MDN 1 static network Streams of data instead of packets First setup route, then send streams (circuit switched) Also a userland network

LOGICAL VS. PHYSICAL NETWORKS 5 physically independent networks Lots of free nearest neighbor on-chip wiring Buffer space takes about 60% tile area vs 1.1% for each network More reliable on-chip network => less buffering to manage link failure

NETWORK TO TILE INTERFACE Tiles have register access to on-chip networks. Instructions can read/write from/to UDN, IDN or STN. MDN and UDN used indirectly on cache miss Register-mapped network access is provided

RECEIVE-SIDE HARDWARE DEMULTIPLEXING Tag word = (sending node, stream num., message type) Receiving hardware demultiplexes message into appropriate queue using tag. On a tag miss, send data to ‘catch all’ queue, then raise interrupt UDN has 4 deMUX queues, one ‘catch all’ IDN has 2 deMUX queues, one ‘catch all’ 128-word reverse side buffering per tile

RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)

PROTECTION Tile Architecture implements Multicore Hardwall (MH) MH protects UDN, IDN and STN links Standard memory protection mechanisms used for MDN and TDN MH blocks attempts to send traffic over hardwalled link, then signals an interrupt to system software Protection is implemented on outbound links

SHARED MEMORY COMMUNICATION AND ORDERING On-chip distributed shared cache Data could be retrieved from Local cache Home tile (request sent through TDN). Data available only in home tile. Coherency maintained here. Main Memory No guaranteed ordering between networks and shared memory Memory fence instructions used to enforce ordering

INTERCONNECT SOFTWARE C based iLib provides communication primitives implemented via UDN Lightweight socket-like streaming channels for streaming algorithms MPI-like message passing interface for adhoc messaging

COMMUNICATION INTERFACES iLib Socket Long-lived FIFO point-to-point connection between two processes Good for producer-consumer relationship Multiple senders-one receiver possible; good for forwarding results to single node for aggregation Raw Channels: low overhead; use as much space as available in buffer Buffered Channels: higher overhead, but virtualization of memory is possible

COMMUNICATION INTERFACES(2) Message Passing API Similar to MPI Messages can be sent from a node to any other at all times No need to establish connections Implementation Sender: Send packet with message key and size Receiver’s catch-all queue interrupts processor If expecting a message with this key, send packet to sender to begin transfer Else, save notification. On ilib_msg_receive() with same key, send packet to interrupt sender to begin transfer

COMMUNICATION INTERFACES(3)

COMMUNICATION INTERFACES(4) UDN’s maximum BW is 4 bytes/cycle Raw Channels’ max BW 3.93 bytes/cycle; overhead due to header word and tag word Buffered Channel: Overhead of memory read/write Message Passing: Overhead of interrupting receiving tile Packet for Buffered and Message Passing = 1 header word + 1 tag word + 16 words of data

COMMUNICATION INTERFACES(5) Packet for Buffered and Message Passing = 1 header word + 1 tag word + 16 words of data

APPLICATIONS Corner Turn Reorganize distributed array from 1 dimension to another Each core send data to every other core Important Factors Network for Distribution (TDN using shared memory or UDN using raw channels) Network for tiles’ synchronization (STN or UDN)

APPLICATIONS (2) Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages. Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock

APPLICATIONS (3) Dot Product Pairwise element multiplication, followed by addition of all products. 65,536-element dot product Shared memory not scalable, higher communication overhead From 2 to 4 tiles, speedup is sublinear because dataset completely fits into tiles’ L2 cache.

CONCLUSION Tile uses unconventional architecture to achieve high on-chip communication BW Effective use of BW possible due to synergy between hardware architecture and software APIs (iLib).