Presentation on theme: "EECS 470 Busses in the real world Lecture 22 – Fall 2013."— Presentation transcript:
EECS 470 Busses in the real world Lecture 22 – Fall 2013
Today’s lecture I want to talk about interconnects. –There are tons of them on a computer Some to memory Some to I/O –There have been lots of implementations I’m going to talk mostly about two older ones: –PCI and P6 But I’m also going to talk a bit about newer ones –Quick Path and (very little) PCI Express Let’s first look at the big picture.
Basic bus issues What are the basic wires for specifying the transaction and moving the data –What are the types of transactions? How are they specified? –How is length of data transfer specified? Who can delay (insert wait states?) How is arbitration done? Out-of-order transfers allowed? –Any restrictions? Error reporting? Weirdness? –Alignment for example.
Transaction types Usually read/write with a length –But in a given domain, other info might be important. Data vs. Code access. I/O vs. memory access Hints to target device –Length might be arbitrary.
Delaying Who can delay and how –Usually a target (slave) can delay –Sometimes initiator (master) can delay –Sometimes initiator can drop the transaction –Sometimes the target has options on how to delay.
Out-of-order Does the bus allow transactions to complete out-of-order? –If so, can increase bandwidth (why?) –If so, might have to worry about ordering issues Memory consistency models not a topic for this class (take EECS 570!) but basics are pretty easy to grasp
PCI PCI stands for “Peripheral Component Interconnect –Many cards you plug into a computer are PCI (most network cards, older graphics cards, etc.) –Normal configurations have PCI as a 33MHz bus with a 32 bit shared address/data lines. This is based on version 2.1 of the PCI spec. –Changes with 3.0 and 2.3 are fairly minor from our viewpoint. L2 Proc Chipset PCI BSB Mem P6 bus
Speeds Conventional PCI is at version 2.3 –Basic version is 32-bits at 33MHz and 5 volts –Version 2.1 allowed 5V or 3.3V and up to 64bit 66MHz PCI-X –Backwards compatible (but not 5 volts apparently) –Up to 533MHz with only 1 load
AD[31:0] C/BE#[3:0] PAR TRDY# IRDY# STOP# DEVSEL# REQ# GNT# CLK FRAME# PERR# SERR# RST# PCI Master Device (required signals only) Address/data and command Interface control Arbitration System Error reporting
Basics AD[31:0] bus is for the address and the data The C/BE#[3:0] is the Command in the address phase and the Byte Enable in the data phase FRAME#, TRDY#, IRDY# are main control signals. Other signals: –PAR is even parity over AD and C/BE# buses. –PERR# and SERR# are Parity and System error reporting –CLK is clock –RST# is a request to reset all devices.
AD and C/BE AD –First phase is address –Everything afterwards is data C/BE# –First phase is command –Rest is byte enable.
Control FRAME# is asserted during the first phase of the transaction and until the last data phase. TRDY# indicates that the target has valid data on the bus (READ) or is able to read valid data (WRITE) IRDY# is the same as TRDY# but for the initiator.
Deep thoughts with Mark Notice that the length of the transaction is not specified explicitly –Starts at the given address. Keeps giving next data until done. But this makes things hard for the target. How much data should be fetched? –So the various read commands give hints about how much data to move Read is for less than a cache line Read line is for a cache line or so Read multiple is for more than 1 or 2 cache lines
More deep thoughts It turns out many NIC cards did things in a really wacky way. –They would read (and/or write) 4KB pages by reading 4 bytes, going away, reading 4 more bytes –This caused significant problems on high-end (web) server performance. –But was okay on most workstations/desktops. Moral: There is a cost vs. performance trade-off on almost everything you do. Be sure to consider the ramifications of solving the problem for only one domain.
CLK AD[0:31] FRAME# C/BE# IRDY# TRDY# Try to draw the write…
One solution CLK AD[0:31] FRAME# C/BE# IRDY# TRDY# ??ADS ??CMD D1D2D4 BE D3
PCI write types Memory Write –Just says gonna write Memory Write and Invalidate –Writing –Will (100% promise!) that will start and end on cache line boundaries. –Why is this useful?
PCI Arbitration Hidden phase –That is, done in parallel with transfers –Centralized arbiter –Arbitration algorithm unspecified, but must be “fair” Fair isn’t all that fair… –#REQ, #GNT –Get bus when #GNT asserted, and FRAME, TRDY#, IRDY# not asserted. –Must give up when #GNT de-asserted in some reasonable time. Notice, arbiter has separate grant and request lines for each PCI master…
Ordering PCI target can say “go away” –STOP# signal Initiator is obligated to come back to finish request. –(Notice with #FRAME the target can tell if transaction was done anyways) Any ordering restrictions not PCI problem.
Basic bus issues: PCI? What are the basic wires for specifying the transaction and moving the data –What are the types of transactions? How are they specified? –How is length of data transfer specified? Who can delay (insert wait states?) How is arbitration done? Out-of-order transfers allowed? –Any restrictions? Error reporting? Weirdness? –Alignment for example.
Basics of the P6 bus The goal of the P6 bus is to allow communication among the processors and the chipset –Transactions are directed toward the chipset. –All of the processors “snoop” the bus. –It uses about 170 pins total L2 Proc Chipset PCI BSB Mem P6 bus
Basics of the P6 bus (cont.) There are generally 6 phases of a transaction. –Arbitration - ask to use the bus –Request - Send Transaction details (R/W, size) –Error - parity error on request mainly –Snoop - let other processors get involved –Response - The “Ack” –Data transfer - Actual movement of data In general the same phase of two transactions are separated by 3 clocks.
Why bother? The goal of this part of the presentation is to expose you to a more complex bus. –The bus is a true “split-transaction” bus That is, it is pipelined. Increased bandwidth due to overlapping of accesses No real impact on latency (why?) –It is the most complex bus I’m aware of. Newer versions of the bus (P3, P4, Itanium) have some changes, but basics are the same.
Bus protocol Each device on the bus has to be fairly sophisticated. –Arbitration is handled without a centralized arbiter. –Each device must keep track of the order of the transactions and which transaction is in what stage. This ordering is called the “In order queue” or IOQ. –In addition there are “Out of Order” transactions. These are used for transactions which are likely to take a while. (So they don’t interfere with the others)
Timing between phases CLK ARB REQ ERROR SNOOP RESP DATA One or more clock ticks
Phase 1: Arbitration The arbitration phase mainly involves 5 pins –BREQ#[0:3] - Symmetric agent request –BPRI# - Priority agent request Each processor keeps track of a rotating ID –The rotating ID is the last device to perform a bus transaction –Each device is only allowed to perform one transaction at a time if other devices also want to use the bus –If more than one device wants to use the bus the winner is the device which is “next” So if the current ID is 2 the priorities are 3, 0, 1, 2 If it is 0 the order is 1,2,3,0.
The arbitration rules The device must continue to hold BREQ asserted until the clock before it gets asserts ADS# (starts the transaction) Once a device starts its transaction it must deassert its BREQ line if any other BREQ line is asserted. On the clock it’s BREQ is deasserted all devices re- compute which device will be allowed to go next. Each agent updates its rotating ID after it deasserts BREQ# If the bus is idle then it can assert ADS# two clocks after winning arbitration. Each ADS# assertion must be at least 3 cycles apart.
Symmetric Arbitration example (with bus parking…) CLK BREQ0 BREQ1 BREQ2 BREQ3 R. ID Active? ADS# N N Y Y Y Y Y Y Y Y Y Y Y Y Y Y 0a1a2a0b0c
0 has a request at time 2 and 9 1 has a request at time 1 2 has a request at time 7 CLK BREQ0 BREQ1 BREQ2 BREQ3 R. ID Active? ADS# 3 N
Request There are about 40 pins involved in this phase The phase lasts 2 clocks. The total of 80 signals (2 x 40) includes: –36 bit address –Type of transaction –Byte enables –Size of transaction –Code/Data info The #ADS signal “qualifies” the request signals –It is low during the first clock of the request phase.
Error Fairly trivial. –Parity is checked for. –If the Parity check fails then AERR# is asserted. All transactions in the IOQ are canceled and everything starts over. I believe current implementations may crash with a parity error at this point.
Snoop There are only 3 signals in the snoop phase –HIT#, HITM#, and DEFER# None-the-less the snoop phase is the most complex part of the whole P6 protocol. HIT# –If a processor has the data in its cache in the Shared or Exclusive state it asserts HIT# HITM# –Is asserted by a processor if it has a “dirty” or Modified version of the data in its cache
Snoop (cont.) DEFER# –Is only asserted by the chipset (or perhaps by some other priority agent). –It says that the chipset wants to pull this transaction out of the IOQ because it could take a while to respond. –DEFER# can also result in a “retry” request If HIT# and HITM# are asserted –It is a snoop stall (ie. an agent on the bus could not respond to the request in time) –Snoop results are re-checked in 2 clocks
Snoop (cont.) If HITM# and DEFER# are asserted –The DEFER# is ignored. If HITM# is asserted –The processor asserting HITM# is responsible for supplying the data –The chipset is expected to “snarf” the data (ie. copy it into the DRAM) as it passes by. Once the snoop phase has happened and DEFER# has not been asserted the transaction must complete.
Response Phase This phase is mostly concerned with 3 signals, called RS[0:2]. The 8 different orderings of these signals encodes the following possibilities: Hard Failure -- Something went VERY wrong Implicit Writeback -- HITM# was asserted Deferred -- Transaction deferred Retry -- Only if DEFER# was asserted Normal Data -- Standard response No Data -- Transaction requires no data
Data Phase This phase consists of –64 bits of data D[0:63]# –A DRDY and TRDY (pretty much the same as IRDY and TRDY on PCI) All transactions are one of: –0 bytes -- Invalidate –8 bytes or less -- Write thru mode and uncacheable addresses will do this –32 bytes -- moving a whole cache line Which one it will be was determined during the request and snoop phases –What does the snoop phase have to do with it?
P6 Review CLK ARB REQ ERROR SNOOP RESP DATA One or more clock ticks
I/O Processor Memory & I/O Chipset L2 Cache L2 Cache Front Side Bus (Processor bus) PCI bus (I/O bus) Memory
QuickPath Interconnect Here things are all point-to-point. –No shared bus –Can be as simple as a single processor talking to the chipset –Can be as complex as picture shown. Memory and I/O interfaces are different Largely taken from: https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect and
Details QPI has two 20-bit signals, one in each direction –Each direction also has a clock So 42 signals. Each signal is a differential pair –Thus 84 pins. 80-bit “flit” is the packet size. –Transferred in two clock cycles (four 20 bit transfers, two per clock.) –The 80-bit "flit" has 8 bits for error detection, 8 bits for "link-layer header," and 64 bits for "data". Thus 8 bytes of useable information per 2 clocks in both directions.
Higher-level protocols These 64-bit packets can be used for anything –Addresses –Data –Routing information –etc. Wires no longer are the thing (address, data, etc.) –More complex for hardware, but so what?
Point-to-point? What about snooping We’re back to a network. –So we need a directory-based solution. Uses a variation of MESI, MESIF –F state is like shared, but is allowed to supply clean data Why? Has two schemes for doing snooping. –Home snoop –Source snoop
Home snoop (1/2)
Home snoop (2/2)
Source Snoop (1/2)
PCI express Basic idea is does the job of PCI using a scheme similar to QPI. –Though it proceeded QPI and QPI borrowed a number of ideas from it. Faster I/O –Again point-to-point for speed –Complexity for speed Target devices need to be smarter.