EECS 470 Busses in the real world Lecture 22 – Fall 2013.

Slides:

Advertisements

Similar presentations

Peripheral Component Interconnect (PCI).

Advertisements

IT253: Computer Organization

Digital Computer Fundamentals

1  1998 Morgan Kaufmann Publishers Interfacing Processors and Peripherals.

Datorteknik BusInterfacing bild 1 Bus Interfacing Processor-Memory Bus –High speed memory bus Backplane Bus –Processor-Interface bus –This is what we usually.

9/20/6Lecture 3 - Instruction Set - Al1 The Hardware Interface.

S. Barua – CPSC 440 CHAPTER 8 INTERFACING PROCESSORS AND PERIPHERALS Topics to be covered  How to.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

TECH CH03 System Buses Computer Components Computer Function

CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.

Input/Output Systems and Peripheral Devices (03-2)

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

PCI & PCI-E Sephiroth Kwon GRMA

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.

CS-334: Computer Architecture

Computer Architecture Lecture 08 Fasih ur Rehman.

Computer Organization CSC 405 Bus Structure. System Bus Functions and Features A bus is a common pathway across which data can travel within a computer.

Digital System Bus A bus in a digital system is a collection of (usually unbroken) signal lines that carry module-to-module communications. The signals.

NS Training Hardware.

PCI Team 3: Adam Meyer, Christopher Koch,

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Top Level View of Computer Function and Interconnection.

DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.

CSS 372 Oct 4th - Lecture 3 Chapter 3 – Connecting Computer Components with Buses Bus Structures Synchronous, Asynchronous Typical Bus Signals Two level,

August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,

MBG 1 CIS501, Fall 99 Lecture 18: Input/Output (I/O): Buses and Peripherals Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.

Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together –well, true... A bunch of wires...

Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.

EEE440 Computer Architecture

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.

Chapter 4 MARIE: An Introduction to a Simple Computer.

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.

Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.

CE 478: Microcontroller Systems University of Wisconsin-Eau Claire Dan Ernst The Pentium Pro® (P6) Bus Reference: “Penium Pro and Pentium II System Architecture”

Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.

CS 478: Microcontroller Systems University of Wisconsin-Eau Claire Dan Ernst Bus Protocols and Interfacing Bus basics I/O transactions MPC555 bus Reference:

Bus Protocols and Interfacing (adopted Steven and Marios’s slides) Bus basics I/O transactions MPC823 bus Reference: Chapter 13 of “White Book”

Fall EE 333 Lillevik 333f06-l22 University of Portland School of Engineering Computer Organization Lecture 22 Project 6 Hard disk drive Bus arbitration.

Mohamed Younis CMCS 411, Computer Architecture 1 CMCS Computer Architecture Lecture 26 Bus Interconnect May 7,

Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Computer Architecture. Top level of Computer A top level of computer consists of CPU, memory, an I/O components, with one or more modules of each type.

Interconnection Structures

Department of Computer Science and Engineering

Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus

Subject Name: Microprocessors Subject Code:10EC46 Department: Electronics and Communication Date:

Memory and IO Addressing

Multiprocessor Cache Coherency

Chapter 3 Top Level View of Computer Function and Interconnection

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

William Stallings Computer Organization and Architecture 7th Edition

Presentation transcript:

EECS 470 Busses in the real world Lecture 22 – Fall 2013

Today’s lecture I want to talk about interconnects. –There are tons of them on a computer Some to memory Some to I/O –There have been lots of implementations I’m going to talk mostly about two older ones: –PCI and P6 But I’m also going to talk a bit about newer ones –Quick Path and (very little) PCI Express Let’s first look at the big picture.

Various buses

Basic bus issues What are the basic wires for specifying the transaction and moving the data –What are the types of transactions? How are they specified? –How is length of data transfer specified? Who can delay (insert wait states?) How is arbitration done? Out-of-order transfers allowed? –Any restrictions? Error reporting? Weirdness? –Alignment for example.

Transaction types Usually read/write with a length –But in a given domain, other info might be important. Data vs. Code access. I/O vs. memory access Hints to target device –Length might be arbitrary.

Delaying Who can delay and how –Usually a target (slave) can delay –Sometimes initiator (master) can delay –Sometimes initiator can drop the transaction –Sometimes the target has options on how to delay.

Arbitration Fairness –Even sharing, priority sharing, weighted sharing Mechanism –Centralized arbiter –Distributed arbiter –Combination Duration –Until done –Until someone else requests –Until certain time passes. –Combination

Out-of-order Does the bus allow transactions to complete out-of-order? –If so, can increase bandwidth (why?) –If so, might have to worry about ordering issues Memory consistency models not a topic for this class (take EECS 570!) but basics are pretty easy to grasp

PCI PCI stands for “Peripheral Component Interconnect –Many cards you plug into a computer are PCI (most network cards, older graphics cards, etc.) –Normal configurations have PCI as a 33MHz bus with a 32 bit shared address/data lines. This is based on version 2.1 of the PCI spec. –Changes with 3.0 and 2.3 are fairly minor from our viewpoint. L2 Proc Chipset PCI BSB Mem P6 bus

Speeds Conventional PCI is at version 2.3 –Basic version is 32-bits at 33MHz and 5 volts –Version 2.1 allowed 5V or 3.3V and up to 64bit 66MHz PCI-X –Backwards compatible (but not 5 volts apparently) –Up to 533MHz with only 1 load

AD[31:0] C/BE#[3:0] PAR TRDY# IRDY# STOP# DEVSEL# REQ# GNT# CLK FRAME# PERR# SERR# RST# PCI Master Device (required signals only) Address/data and command Interface control Arbitration System Error reporting

Basics AD[31:0] bus is for the address and the data The C/BE#[3:0] is the Command in the address phase and the Byte Enable in the data phase FRAME#, TRDY#, IRDY# are main control signals. Other signals: –PAR is even parity over AD and C/BE# buses. –PERR# and SERR# are Parity and System error reporting –CLK is clock –RST# is a request to reset all devices.

AD and C/BE AD –First phase is address –Everything afterwards is data C/BE# –First phase is command –Rest is byte enable.

Control FRAME# is asserted during the first phase of the transaction and until the last data phase. TRDY# indicates that the target has valid data on the bus (READ) or is able to read valid data (WRITE) IRDY# is the same as TRDY# but for the initiator.

Simple PCI read transaction CLK AD[0:31] FRAME# C/BE# IRDY# TRDY# ??ADS ??CMDBE D1D2D3

D2 Simple PCI read transaction cont. CLK AD[0:31] FRAME# C/BE# IRDY# TRDY# ??ADS ??CMDBE D1D2D3

Deep thoughts with Mark Notice that the length of the transaction is not specified explicitly –Starts at the given address. Keeps giving next data until done. But this makes things hard for the target. How much data should be fetched? –So the various read commands give hints about how much data to move Read is for less than a cache line Read line is for a cache line or so Read multiple is for more than 1 or 2 cache lines

More deep thoughts It turns out many NIC cards did things in a really wacky way. –They would read (and/or write) 4KB pages by reading 4 bytes, going away, reading 4 more bytes –This caused significant problems on high-end (web) server performance. –But was okay on most workstations/desktops. Moral: There is a cost vs. performance trade-off on almost everything you do. Be sure to consider the ramifications of solving the problem for only one domain.

CLK AD[0:31] FRAME# C/BE# IRDY# TRDY# Try to draw the write…

One solution CLK AD[0:31] FRAME# C/BE# IRDY# TRDY# ??ADS ??CMD D1D2D4 BE D3

PCI write types Memory Write –Just says gonna write Memory Write and Invalidate –Writing –Will (100% promise!) that will start and end on cache line boundaries. –Why is this useful?

PCI Arbitration Hidden phase –That is, done in parallel with transfers –Centralized arbiter –Arbitration algorithm unspecified, but must be “fair” Fair isn’t all that fair… –#REQ, #GNT –Get bus when #GNT asserted, and FRAME, TRDY#, IRDY# not asserted. –Must give up when #GNT de-asserted in some reasonable time. Notice, arbiter has separate grant and request lines for each PCI master…

Ordering PCI target can say “go away” –STOP# signal Initiator is obligated to come back to finish request. –(Notice with #FRAME the target can tell if transaction was done anyways) Any ordering restrictions not PCI problem.

Basic bus issues: PCI? What are the basic wires for specifying the transaction and moving the data –What are the types of transactions? How are they specified? –How is length of data transfer specified? Who can delay (insert wait states?) How is arbitration done? Out-of-order transfers allowed? –Any restrictions? Error reporting? Weirdness? –Alignment for example.

Basics of the P6 bus The goal of the P6 bus is to allow communication among the processors and the chipset –Transactions are directed toward the chipset. –All of the processors “snoop” the bus. –It uses about 170 pins total L2 Proc Chipset PCI BSB Mem P6 bus

Basics of the P6 bus (cont.) There are generally 6 phases of a transaction. –Arbitration - ask to use the bus –Request - Send Transaction details (R/W, size) –Error - parity error on request mainly –Snoop - let other processors get involved –Response - The “Ack” –Data transfer - Actual movement of data In general the same phase of two transactions are separated by 3 clocks.

Why bother? The goal of this part of the presentation is to expose you to a more complex bus. –The bus is a true “split-transaction” bus That is, it is pipelined. Increased bandwidth due to overlapping of accesses No real impact on latency (why?) –It is the most complex bus I’m aware of. Newer versions of the bus (P3, P4, Itanium) have some changes, but basics are the same.

Bus protocol Each device on the bus has to be fairly sophisticated. –Arbitration is handled without a centralized arbiter. –Each device must keep track of the order of the transactions and which transaction is in what stage. This ordering is called the “In order queue” or IOQ. –In addition there are “Out of Order” transactions. These are used for transactions which are likely to take a while. (So they don’t interfere with the others)

Timing between phases CLK ARB REQ ERROR SNOOP RESP DATA One or more clock ticks

Phase 1: Arbitration The arbitration phase mainly involves 5 pins –BREQ#[0:3] - Symmetric agent request –BPRI# - Priority agent request Each processor keeps track of a rotating ID –The rotating ID is the last device to perform a bus transaction –Each device is only allowed to perform one transaction at a time if other devices also want to use the bus –If more than one device wants to use the bus the winner is the device which is “next” So if the current ID is 2 the priorities are 3, 0, 1, 2 If it is 0 the order is 1,2,3,0.

The arbitration rules The device must continue to hold BREQ asserted until the clock before it gets asserts ADS# (starts the transaction) Once a device starts its transaction it must deassert its BREQ line if any other BREQ line is asserted. On the clock it’s BREQ is deasserted all devices re- compute which device will be allowed to go next. Each agent updates its rotating ID after it deasserts BREQ# If the bus is idle then it can assert ADS# two clocks after winning arbitration. Each ADS# assertion must be at least 3 cycles apart.

Symmetric Arbitration example (with bus parking…) CLK BREQ0 BREQ1 BREQ2 BREQ3 R. ID Active? ADS# N N Y Y Y Y Y Y Y Y Y Y Y Y Y Y 0a1a2a0b0c

0 has a request at time 2 and 9 1 has a request at time 1 2 has a request at time 7 CLK BREQ0 BREQ1 BREQ2 BREQ3 R. ID Active? ADS# 3 N

Request There are about 40 pins involved in this phase The phase lasts 2 clocks. The total of 80 signals (2 x 40) includes: –36 bit address –Type of transaction –Byte enables –Size of transaction –Code/Data info The #ADS signal “qualifies” the request signals –It is low during the first clock of the request phase.

Error Fairly trivial. –Parity is checked for. –If the Parity check fails then AERR# is asserted. All transactions in the IOQ are canceled and everything starts over. I believe current implementations may crash with a parity error at this point.

Snoop There are only 3 signals in the snoop phase –HIT#, HITM#, and DEFER# None-the-less the snoop phase is the most complex part of the whole P6 protocol. HIT# –If a processor has the data in its cache in the Shared or Exclusive state it asserts HIT# HITM# –Is asserted by a processor if it has a “dirty” or Modified version of the data in its cache

Snoop (cont.) DEFER# –Is only asserted by the chipset (or perhaps by some other priority agent). –It says that the chipset wants to pull this transaction out of the IOQ because it could take a while to respond. –DEFER# can also result in a “retry” request If HIT# and HITM# are asserted –It is a snoop stall (ie. an agent on the bus could not respond to the request in time) –Snoop results are re-checked in 2 clocks

Snoop (cont.) If HITM# and DEFER# are asserted –The DEFER# is ignored. If HITM# is asserted –The processor asserting HITM# is responsible for supplying the data –The chipset is expected to “snarf” the data (ie. copy it into the DRAM) as it passes by. Once the snoop phase has happened and DEFER# has not been asserted the transaction must complete.

Response Phase This phase is mostly concerned with 3 signals, called RS[0:2]. The 8 different orderings of these signals encodes the following possibilities: Hard Failure -- Something went VERY wrong Implicit Writeback -- HITM# was asserted Deferred -- Transaction deferred Retry -- Only if DEFER# was asserted Normal Data -- Standard response No Data -- Transaction requires no data

Data Phase This phase consists of –64 bits of data D[0:63]# –A DRDY and TRDY (pretty much the same as IRDY and TRDY on PCI) All transactions are one of: –0 bytes -- Invalidate –8 bytes or less -- Write thru mode and uncacheable addresses will do this –32 bytes -- moving a whole cache line Which one it will be was determined during the request and snoop phases –What does the snoop phase have to do with it?

P6 Review CLK ARB REQ ERROR SNOOP RESP DATA One or more clock ticks

I/O Processor Memory & I/O Chipset L2 Cache L2 Cache Front Side Bus (Processor bus) PCI bus (I/O bus) Memory

QuickPath Interconnect Here things are all point-to-point. –No shared bus –Can be as simple as a single processor talking to the chipset –Can be as complex as picture shown. Memory and I/O interfaces are different Largely taken from: and

Details QPI has two 20-bit signals, one in each direction –Each direction also has a clock So 42 signals. Each signal is a differential pair –Thus 84 pins. 80-bit “flit” is the packet size. –Transferred in two clock cycles (four 20 bit transfers, two per clock.) –The 80-bit "flit" has 8 bits for error detection, 8 bits for "link-layer header," and 64 bits for "data". Thus 8 bytes of useable information per 2 clocks in both directions.

Higher-level protocols These 64-bit packets can be used for anything –Addresses –Data –Routing information –etc. Wires no longer are the thing (address, data, etc.) –More complex for hardware, but so what?

Point-to-point? What about snooping We’re back to a network. –So we need a directory-based solution. Uses a variation of MESI, MESIF –F state is like shared, but is allowed to supply clean data Why? Has two schemes for doing snooping. –Home snoop –Source snoop

Home snoop (1/2)

Home snoop (2/2)

Source Snoop (1/2)

PCI express Basic idea is does the job of PCI using a scheme similar to QPI. –Though it proceeded QPI and QPI borrowed a number of ideas from it. Faster I/O –Again point-to-point for speed –Complexity for speed Target devices need to be smarter.