Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 661 PAPER PRESENTATION

Similar presentations


Presentation on theme: "CSE 661 PAPER PRESENTATION"— Presentation transcript:

1 CSE 661 PAPER PRESENTATION
ON-CHIP INTERCONNECTION ARCHITECTURE OF THE TILE PROCESSOR By D. Wentzlaff et al Presented By SALAMI, Hamza Onoruoiza g

2 OUTLINE OF PRESENTATION
Introduction Tile64 Architecture Interconnect Hardware Network Uses Network to Tile Interface Receive-side Hardware Demultiplexing Protection Shared Memory Communication and Ordering Interconnect Software Communication Interface Applications Conclusion

3 INTRODUCTION Tile Processor’s five on-chip 2D mesh networks differ from traditional bus based scheme; requires global broadcast hence not scalable beyond 8 – 16 cores 1D ring not scalable; bisection BW is constant Can support few or many processors with minimal changes to network structure

4 TILE64 ARCHITECTURE 2D grid of 64 identical compute elements (tiles) arranged in 8 x 8 mesh 1GHz clock, 3-way VLIW, 192 bil. 32-bit instructions/sec 4.8MB distributed cache, per tile TLB Supports DMA and virtual memory Tiles may run independent OSs. May be combined to run multiprocessor OS such as SMP Linux Shared memory. Cores directly access other cores’ cache through on-chip interconnects

5 TILE64 ARCHITECTURE (2) Off chip memory BW ≤ 200Gbps I/O BW ≥ 40Gbps

6 TILE64 ARCHITECTURE (3) Courtesy:

7 INTERCONNECT HARDWARE
5 low latency mesh networks Each network connects tile in five directions; north, south, east, west and processor Each link made of two 32-bit unidirectional links

8 INTERCONNECT HARDWARE(2)
1.28Tb/s BW in and out of a single tile

9 NETWORK USES 4 dynamic networks 1 static network
packet header contains destination’s (x, y) coordinate and packet length (≤128 words) Flow controlled, reliable delivery UDN: low latency comm. between userland processes without OS intervention IDN: direct communication with I/O devices MDN: communication with off-chip memory TDN: direct tile-to-tile transfers; requests through TDN, response through MDN 1 static network Streams of data instead of packets First setup route, then send streams (circuit switched) Also a userland network

10 LOGICAL VS. PHYSICAL NETWORKS
5 physically independent networks Lots of free nearest neighbor on-chip wiring Buffer space takes about 60% tile area vs 1.1% for each network More reliable on-chip network => less buffering to manage link failure

11 NETWORK TO TILE INTERFACE
Tiles have register access to on-chip networks. Instructions can read/write from/to UDN, IDN or STN. MDN and UDN used indirectly on cache miss Register-mapped network access is provided

12 RECEIVE-SIDE HARDWARE DEMULTIPLEXING
Tag word = (sending node, stream num., message type) Receiving hardware demultiplexes message into appropriate queue using tag. On a tag miss, send data to ‘catch all’ queue, then raise interrupt UDN has 4 deMUX queues, one ‘catch all’ IDN has 2 deMUX queues, one ‘catch all’ 128-word reverse side buffering per tile

13 RECEIVE-SIDE HARDWARE DEMULTIPLEXING(2)

14 PROTECTION Tile Architecture implements Multicore Hardwall (MH)
MH protects UDN, IDN and STN links Standard memory protection mechanisms used for MDN and TDN MH blocks attempts to send traffic over hardwalled link, then signals an interrupt to system software Protection is implemented on outbound links

15 SHARED MEMORY COMMUNICATION AND ORDERING
On-chip distributed shared cache Data could be retrieved from Local cache Home tile (request sent through TDN). Data available only in home tile. Coherency maintained here. Main Memory No guaranteed ordering between networks and shared memory Memory fence instructions used to enforce ordering

16 INTERCONNECT SOFTWARE
C based iLib provides communication primitives implemented via UDN Lightweight socket-like streaming channels for streaming algorithms MPI-like message passing interface for adhoc messaging

17 COMMUNICATION INTERFACES
iLib Socket Long-lived FIFO point-to-point connection between two processes Good for producer-consumer relationship Multiple senders-one receiver possible; good for forwarding results to single node for aggregation Raw Channels: low overhead; use as much space as available in buffer Buffered Channels: higher overhead, but virtualization of memory is possible

18 COMMUNICATION INTERFACES(2)
Message Passing API Similar to MPI Messages can be sent from a node to any other at all times No need to establish connections Implementation Sender: Send packet with message key and size Receiver’s catch-all queue interrupts processor If expecting a message with this key, send packet to sender to begin transfer Else, save notification. On ilib_msg_receive() with same key, send packet to interrupt sender to begin transfer

19 COMMUNICATION INTERFACES(3)

20 COMMUNICATION INTERFACES(4)
UDN’s maximum BW is 4 bytes/cycle Raw Channels’ max BW 3.93 bytes/cycle; overhead due to header word and tag word Buffered Channel: Overhead of memory read/write Message Passing: Overhead of interrupting receiving tile Packet for Buffered and Message Passing = 1 header word + 1 tag word + 16 words of data

21 COMMUNICATION INTERFACES(5)
Packet for Buffered and Message Passing = 1 header word + 1 tag word + 16 words of data

22 APPLICATIONS Corner Turn
Reorganize distributed array from 1 dimension to another Each core send data to every other core Important Factors Network for Distribution (TDN using shared memory or UDN using raw channels) Network for tiles’ synchronization (STN or UDN)

23 APPLICATIONS (2) Raw Channel, STN synch: best performance. Minimum overhead raw channels. STN ensures synch messages don’t interfere with data Raw Channel, UDN synch: UDN used for data and synch messages. Extra overhead data to distinguish between both messages. Shared Memory: Simpler to program . Each user data word incurs four extra words to manage network and avoid deadlock

24 APPLICATIONS (3) Dot Product
Pairwise element multiplication, followed by addition of all products. 65,536-element dot product Shared memory not scalable, higher communication overhead From 2 to 4 tiles, speedup is sublinear because dataset completely fits into tiles’ L2 cache.

25 CONCLUSION Tile uses unconventional architecture to achieve high on-chip communication BW Effective use of BW possible due to synergy between hardware architecture and software APIs (iLib).


Download ppt "CSE 661 PAPER PRESENTATION"

Similar presentations


Ads by Google