for the Back-End Design Step

Name: for the Back-End Design Step
Uploaded: 2017-07-12T23:40:42+00:00
Duration: PTM42S58
Channel: Marvin Doyle
Description: for the Back-End Design Step

for the Back-End Design Step
Algorithmic model UnTimed Functional (UTF) model Timed Functional (TF) model Abstraction level Simulation speed Simulation accuracy Synthesizability Bus Cycle Accurate (BCA) model Cycle Accurate (CA) model Register Transfer Level (RTL) model System-Level Modelling and Simulation Tools for the Front-End Design Step Create Floorplan Create Power Grid Placement Clock tree synthesis (CTS) Routing Post-routing optimization System-on-Chip Design Methodologies for the Back-End Design Step Back-End Synthesis Flow

What do we have to design? Algorithmic model
UnTimed Functional (UTF) model Timed Functional (TF) model Abstraction level Simulation speed Simulation accuracy Synthesizability Bus Cycle Accurate (BCA) model Cycle Accurate (CA) model Register Transfer Level (RTL) model What do we have to design? Create Floorplan Create Power Grid Placement Clock tree synthesis (CTS) Routing Post-routing optimization Back-End Synthesis Flow

communication backbone for the system as a whole!
Let us start from the communication backbone for the system as a whole!

System-Level Interconnection Protocols and Topologies
State-of-the-Art System-Level Interconnection Protocols and Topologies

The key role of on-chip communication
Highly Parallel Computing Architectures The communication architecture is key to materializing the expected computation performance and meeting the total power budget ARCHITECTURAL CHALLENGE PHYSICAL CHALLENGE Hell of nano-scale physics

The communication bottleneck
Historically: Shared bus topology Aimed at simple, cost-effective integration of components Typical example: ARM Ltd. AMBA AHB Slave 1 Master 1 Slave 3 Master 2 Slave 2 Master 3 CPU1 Slave 4 Slave 5 BUS Slave 1 Slave 2 Slave 3 Problems: -SERIALIZATION of bus access requests -SINGLE outstanding transaction If wait states for memory access are needed, everybody waits

Bus evolution Protocol (the rules of the road)
the topology (throughtput, latency) Toward improved utilization of Topology (the street map) BUS Toward enhanced parallelism (more communication flows at the same time)

An «advanced» starting point
The ADVANCED MICROCONTROLLER BUS ARCHITECTURE (AMBA) BUS open-standard, on-chip interconnect specification for the connection and management of functional blocks in SoC design. From ARM Version 1 (1996) Advanced Peripheral Bus (APB) Version 2 Advanced High-Performance Bus (AHB) Version 3 (2003): Advanced Extensible Interface (AXI) Version 4 (2010): AXI4 (2011): AXI Coherence Extensions (ACE) Version 5 (2013): Coherent Hub Interface (CHI) These protocols are today the de facto standard for 32-bit embedded processors because they are well documented and can be used without any royalties.

A SoC today… System interconnect

Bus Components: terminology
Initiator The FU that initiates transactions Target The FU that responds to incoming transaction Master/Slave The initiator/target side of the bus interface Bridge Connects two buses It acts as an initiator on one side and a target on the other Bus actors BUS Microprocessor core (Initiator) Memory (Target) Master Slave

AMBA BUS OBJECTIVES Facilitate the right-first-time development of multi-master and multi-slave SoCs Be technology independent and allow IP design reuse Encourage modular system design Minimize silicon infrastructure for on-chip communication

AMBA BUS(SES) AHB: Advanced High Performance Bus
ARM processor High bandwidth On-chip RAM BRIDGE UART Timer High Bandwidth External Memory Interface AHB APB KeyPad PIO DMA AHB: Advanced High Performance Bus High Performance Pipelined Operation Multiple Bus Masters Burst transfers Split transactions APB: Advanced Peripheral Bus Low Power Latched Address and Control Simple Interface – 1 master Suitable for Many Peripherals

AMBA AHB Bus architecture - datapath
Haddr= address bus Hwdata = write data bus Hrdata= read data bus Mux-based interconnect scheme Masters can generate data AND control signals A central arbiter determines which signals will be propagated all the way to the slaves Broadcast paradigm for slave connectivity A decoder selects which slave is involved in the transaction, and which response signals to propagate back to the masters Different wires  No bidirectional wires

Centralized decoder. Simple (high-speed) decoding of the HADDR MSBs
Address decoding Slave #1 Master #1 Shared address Bus HADDR_M1[31:0] HADDR to all slaves Slave #2 HADDR_M2[31:0] Master #2 Address and Control mux Decoder Slave #3 HSEL_S1 HSEL_S2 HSEL_S3 Dedicated control wires Centralized decoder. Simple (high-speed) decoding of the HADDR MSBs

AMBA basic transfer For a write Driven by slave For a read Because of the 2-stage pipeline, pipelining in AHB is a means of anticipating the address of the next transaction, not of supporting multiple outstanding transactions!

Transfer with WAIT states
Next address 2 wait cycles During wait states, all communication actors “wait”! …which is always a bad thing!

NOTE: Do not confuse arbitration Protocol with arbitration policy
Bus Arbitration Buses can support multiple initiators Need a protocol for allocating bus resources and resolving conflicts Bus arbitration protocol Need a decision procedure to choose Arbitration policy NOTE: Do not confuse arbitration Protocol with arbitration policy

Arbitration Protocol AHB uses a simple Request-Grant protocol
HBREQ_M1 Granted master HMASTER[3:0] ARBITER Master #1 HGRANT_M1 HADDR_M1[31:0] Shared address bus Master #2 HGRANT_M2 HADDR_M2[31:0] HADDR to all slaves Dedicated wires HBREQ_M2 Master #3 HGRANT_M3 HADDR_M3[31:0] HBREQ_M3 Arbitration policy is NOT specified by AMBA (is a degree of freedom for the user: round-robin, fixed-priority,..), while the protocol IS!

Granting bus access “Hready” dictates transaction timing
Waiting for access Previous data phase (transfer) completed Bus access time Address phase Data phase “Hready” dictates transaction timing

Burst transfers Arbitration has a significant cost
Burst transfers amortize arbitration cost Grant bus control for a number of data transfers (not for a single one) Of Help with DMA and block transfers Requires safeguards against starvation Starvation is when a ready-to-go bus transaction is blocked indefinitely since it cannot acquire the needed resources

AMBA AHB Burst types SINGLE: single transfers
INCR: incremental burst of unspecified length INCRx: incremental burst of x words (beats) WRAPx: wrapping burst of x words (beats) Wrapping for transactions not aligned to word_size*beat CACHE MISS(B) Cache line L2 L2 L2 B B C D A B C D RESTART CPU running CPU running

4-beat incrementing burst

Handover after burst The arbiter changes HGRANTx signals when the penultimate (one before last) address has been sampled. The new HGRANTx information will then be sampled at the same point as the last address of the burst is sampled, thus giving rise to no handover overhead

Slave responses Once initiated, a master cannot suspend/cancel a transfer the slave determines how the transfer should progress The slave provides the status of the transfer through HREADY HRESP[0,1] signals Advantage: if slave wait states < x (say, 16) cycles, keep the bus busy, otherwise release it and re-gain it later on.

Slave responses Default: when master samples HREADY high and HRESP OK, Transfer done! E.g.: memory protection. Write access to read-only memory location The master should retry until it completes...but may be ungranted! Prevents starvation of high-priority masters The master should retry the transfer when it is next granted. Prevents starvation of all masters Error, Retry and Split are 2-cycle responses due to the pipelined nature of the bus

Slave Retry response The slave reads in the first address A, and then signals a retry response Retry is denoted by 1) Hready Low && 2) HRESP=Retry An idle cycle follows for state rollback in bus and master FSMs

Split and retry Used when the slave cannot complete the transfer right away Bus may be re-arbitrated right after the split/retry procedure: RETRY: only higher-priority masters can access the bus, otherwise the transfer is retried right away! SPLIT: The master is excluded from arbitration! Any other master can access the bus. Arbiter must know when the slave is ready to terminate the transaction with the pre-empted master, which is then readmitted to arbitration From the MASTER’s viewpoint , nothing changes: it keeps requesting the bus for completing the transfer

Recovery from SPLIT It is initiated by the slave! slave arbiter
When the slave is ready to complete the transfer, it notifies the arbiter which master should be regranted access to the bus slave arbiter HSPLIT[..] 1 when any bit of HSPLITx is asserted, the arbiter restores the priority of the appropriate master Eventually the arbiter will grant the master so it can re-attempt the transfer. This may not occur immediately if a higher priority master is using the bus.

AHB: critical overview
Protocol Lacks parallelism In-order completion No multiple outstanding transactions: cannot hide slave wait states effectively (split transactions are an afterthought course of action to handle this) High arbitration overhead (min. 2 cycles on single-transfers) Bus-centric architecture (not transaction-centric) Initiators and targets are directly exposed to bus architecture internals (e.g. arbiter) No decoupling, instance-specific bus components Topology Scalability limitation of shared bus solutions!

Bus evolution Toward enhanced parallelism Topology

Topology evolution Shared bus with unidirectional
request and response lanes Crossbar with unidirectional request and response lanes

Topology evolution Partial Crossbar with unidirectional request and
response lanes xbar S4 S3 S2 S1 S0 M6 P0 P1 T0 M0 M1 Shared bus P2 P3 T1 M2 M3 P4 P5 T2 M4 M5 P6 P7 T3 M7 P8 P9 T4 M8 M9 Multi-layer bus architecture

But what is there on the market?

AMBA Multi-layer AHB AHB AHB
Enables parallel access paths between multiple masters and slaves Fully compatible with AHB wrappers It is a topology (not protocol) evolution Pure combinational matrix (scales poorly with no of I/Os) AHB Interconnect Matrix Slave1 Master1 Slave1 AHB Master2 Slave1

Multi-Layer AHB implementation
The matrix is completely flexible and can be adapted MUXes are point arbitration stages AHB layer can be AHB-lite: single master, no req/grant, no split/retry A layer is waited through the backward Hready signal deasserted low. HReady HReady

Hierarchical systems Slaves accessed only by masters on a given layer can be made local to the layer

Multiple slaves Multiple slaves appear as single slave to the matrix
combine low bandwidth slaves group slaves accessed only by one master (e.g., DMA controller) Alternatively, a slave can be an AHB-to-APB bridge, thus allowing connection to multiple low-bandwidth slaves

Multiple masters per layer
Combine masters that have low bandwidth requirements

Putting it alltogether…
Interconnect matrix used for across-layer communication HW semaphores

Dual port slaves Common for off-chip SDRAM controllers
Layer1: bandwidth limited high priority traffic with low latency requirements (e.g., processor cores) Layer2: Bandwidth-critical traffic (e.g., hardware accelerators) The dual-port slave may even be connected to the matrix

Bus evolution Protocol the topology (throughtput, latency)
Toward improved utilization of

Protocol Evolutions: Split transactions
A split-transaction bus is a bus where the request and response phases are split and independent to improve bus utilization Master must arbitrate for the request phase Slave must arbitrate for the response phase Master Slave Request Response Bus Busy Bus released Bus busy Bus released

Multiple outstanding transactions
Master Slave Queue of pending requests Queue of pending responses Requests Responses The master needs to associate each response to one of its pending requests The initiator should support multiple outstanding transactions too

Out-of-order completion
Master S1-slow S2 -fast Queue of pending requests Queue of pending requests time To S2 To S1 Requests From S2 From S1 Association between requests and responses is more challenging The typical case for out-of-order completion is when a fast slave is addressed after a slow slave. The fast slave will return its response earlier.

Out-of-order completion
Master S1 anticipated Queue of pending requests S12 S11 time S11 S12 Requests Resp of S12 Resp of S11 Out-of-order completion even in case multiple outstanding transactions are addressed to the same complex slave A complex slave may use local optimizations and change the processing order of incoming requests (e.g., serve accesses to an open row first in an SDRAM device)

Bus-centric architecture
Master interface Slave interface Bus architecture Internal bus components are directly exposed to the connected master and slave interfaces The bus architecture is instance-specific and lacks modularity

Transaction-centric bus architecture
Master interface Slave interface Point-to-point Communication Protocol Slave interface Master interface Hidden components Bus architecture Internal bus components are hidden behind bus interfaces Modular architecture Internal bus architecture can freely evolve without impacting the interfaces The only objective of interfaces: specifying communication transactions! (communication abstraction)

But what is there on the market?

AMBA 3.0 (AMBA AXI) This is an evolution of the AHB communication protocol High bandwidth – low latency designs High frequency operation Flexibility in the implementation Backward compatible with AHB and APB Novel features with respect to AHB Burst-based transactions with only first address issued Address information can be issued before/after actual write data transfer Multiple outstanding addresses Out-of-order transaction completion Easy addition of register stages for timing closure

Design paradigm change
Master Slave Master Slave Communication architecture AXI AXI Target Initiator Point-to-point interface specification Independent of the implementation of the communication architecture Communication architecture can (be) freely evolve (customized) Transaction-based specification of the interface Open Core Protocol (OCP) is another example of this paradigm

Transaction-centric bus
AXI can be used to interconnect: an initiator to the bus a target to the bus an initiator with a target The interface definition allows a variety of different interconnection schemes implementations Master Slave AXI Initiator Target

Channel-based Architecture
Five groups of signals (channels) Read Address “AR” signal name prefix Read Data “R” signal name prefix Write Address “AW” signal name prefix Write Data “W” signal name prefix Write Response “B” signal name prefix R. ADDRESS READ DATA WRITE DATA RESPONSE W. ADDRESS Channels are independent and asynchronous wrt each other

Interconnect approaches
Master Slave Master Slave Address lane Write Response lane Master Slave Master Slave Write data lanes Read data lanes Most systems use one of three interconnect approaches: shared address and data buses Shared address bus and multiple data buses Multilayer, with multiple address and data buses Most common

Read transaction Single address for burst transfers

Write transaction Single response for an entire burst

Channels - One way flow AWVALID WVALID RVALID BVALID AWDDR WLAST RLAST BRESP AWLEN WDATA RDATA BID AWSIZE WSTRB RRESP BREADY AWBURST WID RID AWLOCK WREADY RREADY AWCACHE AWPROT Channel: a set of unidirectional information signals Valid/Ready handshake mechanism READY is the only return signal Valid: source IF has valid data/control signals Ready: destination IF is ready to accept data Last: indicates last word of a burst transaction AWID AWREADY

Valid – ready handshake flexibility
Proactive Ready Asynchronous Ready Synchronous Ready

AMBA 2.0 AHB Burst AHB Burst Address and Data are locked together
Two pipeline stages HREADY controls pipeline operation

AXI - One Address for Burst
DATA D11 D12 D13 D14 D21 D22 D23 D31 AXI Burst One Address for entire burst

AXI - Outstanding Transactions
ADDRESS A11 A21 D31 DATA D11 D12 D13 D14 D21 D22 D23 D31 AXI Burst One Address for entire burst Allows multiple outstanding addresses

Problem: Slow slave If one slave is very slow, all data is held up.
ADDRESS A11 A21 A31 DATA D11 D12 If one slave is very slow, all data is held up.

Out-of-Order Completion
ADDRESS A11 A21 D31 DATA D21 D22 D23 D31 D11 D12 D13 D14 Fast slaves may return data ahead of slow slaves Complex slaves may serve requests out-of-order Each transaction has an ID attached (given by the master IF) Channels have ID signals - AID, RID, etc. Transactions with the same ID must be ordered The interconnect in a multi-master system must append another tag to ID to make each master’s ID unique

AXI - Data Interleaving
ADDRESS A11 A21 D31 DATA D21 D22 D11 D23 D12 D31 D13 D14 Returned data can even be interleaved Gives maximum use of data bus Note - Data within a burst is always in order

Burst read Valid high until ready high
The valid-ready handshake regulates data transfer This is clearly a split transaction bus!

Register slices for max frequency
Channels are asynchronous Register slices can be applied across any channel Allows maximum frequency of operation by changing delay into latency WID WDATA WSTRB WLAST WVALID WREADY

Comparison 2 wait states memories
Init1 Mem1 Comparison Init2 Bus Mem2 2 wait states memories Init3 Mem3 READYi It is impossible to hide slave response latency AHB While the previous response phase is in progress, a new request can be processed by the next addressed slave STBUS low buf STBUS high buf More data pre-accessed while previous response phase is in progress Interleaving support in interfaces and interconnect allow better interconnect exploitation AXI The Bus ability should be to hide wait states!

Scalability Highly parallel benchmark (no slave bottlenecks)
1 memory wait state 1 kB cache (low bus traffic) 256 B cache (high bus traffic)

Scalability Increasing contention: AXI, STBus show 80%+ efficiency, AHB < 50% Saturation of shared bus architectures

Network-on-Chip (NoC) topologies
Packetized communication PAYLOAD HEADER TAIL Packet FLIT … The interconnect matrix is inside a single switch! With respect to the busses seen so far, the NoC: is a protocol evolution - it uses a NETWORK PROTOCOL: packetized communication is a topology evolution - it uses HIGHLY PARALLEL TOPOLOGIES: a combination of crossbars within a multi-hop interconnect fabric

Case study – Multi-Layer bus vs NoC

Design predictability: case study
AMBA AHB IC Matrix (crossbar) Pi=processors Ti=traffic gen. Mi=memories Si=shared mem. P0 P1 T0 M0 M1 AHB Layer S0 S1 P2 P3 T1 M2 M3 AHB Layer Hierarchical Bus (crossbar + shared buses) topology. AHB Protocol. S2 P4 P5 T2 M4 M5 AHB Layer S3 P6 P7 T3 M7 AHB Layer S4 P8 P9 T4 M8 M9 AHB Layer M6 Shared bus

AMBA Multilayer Layout
1 mm² AMBA Shared Slaves AHB Layer 35,3 mm² area IP cores modeled as 1 mm² obstructions (realistic for ARM cores, 32 kB SRAM banks) 130 nm UMC technology library

Design predictability: case study
Mesh Network-on-Chip P0 M0 T0 S0 P1 M1 P2 M2 T1 S1 15 masters, 15 slaves Each core: 1 mm² 30 NI, 15 switches 38 bit flits 3 flit FIFOs Fixed priority arb. ACK/NACK flow cntrl. Non pipelined links P3 M3 T2 S2 P4 M4 T3 S3 P5 M5 P6 M6 P7 M7 T4 S4 P8 M8 P9 M9

xpipes Quasi-Mesh Layout
Wire routing over the cores was forbidden (worst-case assumption) Did not manually iterate the placement to improve area loosely packed tiles Done twice for 21-bit, 38-bit flits 1 mm² Mesh Row 2 NIs + 1 switch Floorplan Area 43,8 mm²

Design performance and predictability
Frequency degradation multilayer: -23% ! Network on chip: -3% Global wiring is segmented and does not add unexpected penalty when doing P&R Gap should increase with miniaturization NoC synthesizes at >2x the frequency

Performance Assessment
Shared bus is fully saturated and unusable Multilayer topology is much more effective NoC shows 10-15% faster execution than Multilayer Why?

Bandwidth vs. Latency NoC bandwidth is much higher (44 links, ~1 GHz)
short reads Bandwidth (GB/s) Overall bandwidth AMBA Multilayer 26.5 xpipes qmesh/21 100 xpipes qmesh/38 180 NoC bandwidth is much higher (44 links, ~1 GHz) But this is indirect clue of performance Eventual performance is given by latency NoC latency penalty/gain depends on transaction Penalty on short reads Gain on posted writes posted writes

Power Consumption Results
Power (mW) Sequential Combinatio nal Overall Seq. ratio AMBA Multilayer 6 66 72 18.5% xpipes qmesh/21 296 81 377 78.5% xpipes qmesh/38 416 85 501 83.0% NoC has higher power consumption Also due to >2x operating frequency NoC power consumption is mostly in sequential cells (flip-flops) Buffering must be kept to a minimum

Energy Consumption of a System
Energy (mJ) Benchmark run time Fabric only 1W system 5W system AMBA Multilayer 1 ms 0.072 1.07 5.60 xpipes qmesh/21 0.9 ms 0.339 1.34 5.32 xpipes qmesh/38 0.85 ms 0.426 1.37 5.13 Energy per benchmark run is reported NoC energy consumption can be comparable to traditional fabric, even lower Since execution time goes down, the rest of the system burns less energy

Summing up Different fabrics feature different tradeoffs Today
NoCs can perform better than traditional fabrics under heavy load NoCs are more scalable and predictable NoCs have an area penalty The power/energy assessment is positive provided TOTAL SYSTEM ENERGY is considered Interconnect power’s importance is relative! Today NoC area and power have been greatly improved NoCs NOT allocated everywhere, but in those places where they can unfold their potentials. But they are pervasive through application-specific topologies!

for the Back-End Design Step

Similar presentations

Presentation on theme: "for the Back-End Design Step"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

for the Back-End Design Step

Similar presentations

Presentation on theme: "for the Back-End Design Step"— Presentation transcript:

Similar presentations

About project

Feedback