[Adapted from Dave Patterson’s UCB CS152 slides and

14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 12 Buses and I/O system
[Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides]

Head’s Up This week’s material Reminders Buses: Connecting I/O devices
Reading assignment – PH 8.4 Memory hierarchies Reading assignment – PH 7.1 and B.5 Reminders Next week’s material Basics of caches Reading assignment – PH 7.2

Review: Major Components of a Computer
Processor Devices Control Output Memory Datapath Input Cache Main Memory Secondary Memory (Disk)

Input and Output Devices
I/O devices are incredibly diverse wrt Behavior Partner Data rate Device Behavior Partner Data rate (KB/sec) Keyboard input human 0.01 Mouse 0.02 Laser printer output 200.00 Graphics display 60,000.00 Network/LAN input or output machine Floppy disk storage 100.00 Magnetic disk ,000.00 Next, let’s take a closer look at one of the most popular storage device, magnetic disks.

Magnetic Disk Purpose General structure
Long term, nonvolatile storage Lowest level in the memory hierarchy slow, large, inexpensive General structure A rotating platter coated with a magnetic surface Use a moveable read/write head to access the disk Advantages of hard disks over floppy disks Platters are more rigid (metal or glass) so they can be larger Higher density because it can be controlled more precisely Higher data rate because it spins faster Can incorporate more than one platter The purpose of the magnetic disk is to provide long term, non-volatile storage. Disks are large in terms of capacity, inexpensive, but slow so they reside at the lowest level in the memory hierarchy. There are 2 types of disks: floppy and hard drives. Both types relay on a rotating platter coasted with a magnetic surface and a movable head is used to access the disk. The advantages of hard disks over floppy disks are: (a) Platters are made of metal or glass so they are more rigid and can be larger. (b) Hard disk also has higher density because it can be controlled more precisely. (c) Hard disk also has higher data rate because it can spin faster. (d) Finally, each hard disk drive can incorporate more than one platter. +2 = 30 min. (Y:10)

Organization of a Magnetic Disk
Sector Platters Track Typical numbers (depending on the disk size) 1 to 15 (2 surface) platters per disk with 1” to 8” diameter 1,000 to 5,000 tracks per surface 63 to 256 sectors per track the smallest unit that can be read/written (typically 512 to 1,024 B) Traditionally all tracks have the same number of sectors Newer disks with smart controllers can record more sectors on the outer tracks (constant bit density) Here is a primitive picture showing you how a disk drive can have multiple platters. Each surface on the platter are divided into tracks and each track is further divided into sectors. A sector is the smallest unit that can be read or written. By simple geometry you know the outer track have more area and you would thing the outer tack will have more sectors. This, however, is not the case in traditional disk design where all tracks have the same number of sectors. Well, you will say, this is dumb but dumb is the reason they do it . By keeping the number of sectors the same, the disk controller hardware and software can be dumb and does not have to know which track has how many sectors. With more intelligent disk controller hardware and software, it is getting more popular to record more sectors on the outer tracks. This is referred to as constant bit density. +2 = 32 min. (Y:12)

Magnetic Disk Characteristic
Cylinder: all the tracks under the heads at a given point on all surfaces Read/write data is a three-stage process: Seek time: position the arm over the proper track (6 to 14 ms avg.) due to locality of disk references the actual average seek time may be only 25% to 33% of the advertised number Rotational latency: wait for the desired sector to rotate under the read/write head (½ of 1/RPM) Transfer time: transfer a block of bits (sector) under the read-write head (2 to 20 MB/sec typical) Controller time: the overhead the disk controller imposes in performing an disk I/O access (typically < 2 ms) Sector Track Cylinder Head Platter To read write information into a sector, a movable arm containing a read/write head is located over each surface. The term cylinder is used to refer to all the tracks under the read/write head at a given point on all surfaces. To access data, the operating system must direct the disk through a 3-stage process. (a) The first step is to position the arm over the proper track.This is the seek operation and the time to complete this operation is called the seek time. (b) Once the head has reached the correct track, we must wait for the desired sector to rotate under the read/write head. This is referred to as the rotational latency. (c) Finally, once the desired sector is under the read/write head, the data transfer can begin. The average seek time as reported by the manufacturer is in the range of 12 ms to 20ms and is calculated as the sum of the time for all possible seeks divided by the number of possible seeks. This number is usually on the pessimistic side because due to locality of disk reference, the actual average seek time may only be 25 to 33% of the number published. As far as rotational latency is concerned, most disks rotate at 3,600 RPM or approximately 16 ms per revolution. Since on average, the information you desired is half way around the disk, the average rotational latency will be 8ms. The transfer time is a function of transfer size, rotation speed, and recording density. The typical transfer speed is 2 to 4 MB per second. Notice that the transfer time is much faster than the rotational latency and seek time. This is similar to the DRAM situation where the DRAM access time is much shorter than the DRAM cycle time. ***** Does anybody remember what we did to take advantage of the short access time versus cycle time? Well, we interleave!

Magnetic Disk Examples
Characteristic Sun X6713A Toshiba MK2016 Disk diameter (inches) 3.5 2.5 Capacity 73 GB 20 GB MTTF (k hr’s) 1,200 300 # of platters - heads 2 - 4 # cylinders 16,383 # B/sector - # sectors/track Rotation speed (RPM) 10,000 4,200 Max. - Avg. seek time (ms) ? - 6.6 Avg. rot. latency (ms) 3 7.14 Transfer rate (PIO) 35 MB/sec 16.6 MB/sec Power (watts) < 2.5 Volume (in3) 4.01 Weight (oz) 3.49 The MTTF is the mean time to failure which is a common measure of reliability.

I/O System Interconnect Issues
Processor bus Main Memory Receiver Keyboard A bus is a shared communication link (a set of wires used to connect multiple subsystems) Performance Expandability Resilience in the face of failure – fault tolerance

Performance Measures Latency (execution time, response time) is the total time from the start to finish of one instruction or action usually used to measure processor performance Throughput – total amount of work done in a given amount of time aka execution bandwidth the number of operations performed per second Bandwidth – amount of information communicated across an interconnect (e.g., a bus) per unit time the bit width of the operation * rate of the operation usually used to measure I/O performance

I/O System Expandability
Usually have more than one I/O device in the system each I/O device is controlled by an I/O Controller Processor interrupt signals Cache Memory Memory - I/O Bus I/O Controller I/O Controller I/O Controller Main Memory Terminal Disk Disk Network

Bus Characteristics Control lines Data lines
Signal requests and acknowledgments Indicate what type of information is on the data lines Data lines Data, complex commands, and addresses Bus transaction consists of Sending the address Receiving (or sending) the data Control Lines Data Lines A bus generally contains a set of control lines and a set of data lines. The control lines are used to signal requests and acknowledgments and to indicate what type of information is on the data lines. The data lines carry information between the source and the destination. This information may consists of data, addresses, or complex commands. A bus transaction includes tow parts: (a) sending the address and (b) then receiving or sending the data.

Output (Read) Bus Transaction
Defined by what they do to memory read = output: transfers data from memory (read) to I/O device (write) Processor Main Memory Control Data Step 1: Processor sends read request and read address to memory Processor Control Data Main Memory Step 2: Memory accesses data Step 1: initiates a read from memory; the control lines signal a read request to memory, while the data lines contain the address Step 2: memory accesses data Step 3: memory transfers the data using the data lines of the bus and signals that the data is available to the I/O device using the control lines; the device stores the data as it appears on the bus. How does it know where to store the data on the disk? We will see that this has to have been communicated in an prior transaction Notice that the data lines of the bus can carry both an address (as in step 1) and data (as in step 3) Processor Control Data Main Memory Step 3: Memory transfers data to disk

Input (Write) Bus Transaction
Defined by what they do to memory write = input: transfers data from I/O device (read) to memory (write) Processor Main Memory Control Data Step 1: Processor sends write request and write address to memory Processor Control Data Main Memory Step 2: Disk transfers data to memory Step 1: control lines indicate a write request for to memory while the data lines contain the address Step 2: the memory is ready to write and signals the I/O device which then transfers the data; the memory will store the data as it receives it – the device need not wait for the store to be completed

Advantages and Disadvantages of Buses
Versatility: New devices can be added easily Peripherals can be moved between computer systems that use the same bus standard Low Cost: A single set of wires is shared in multiple ways Disadvantages It creates a communication bottleneck The bus bandwidth limits the maximum I/O throughput The maximum bus speed is largely limited by The length of the bus The number of devices on the bus It needs to support a range of devices with widely varying latencies and data transfer rates The two major advantages of the bus organization are versatility and low cost. By versatility, we mean new devices can easily be added. Furthermore, if a device is designed according to a industry bus standard, it can be move between computer systems that use the same bus standard. The bus organization is a low cost solution because a single set of wires is shared in multiple ways. The major disadvantage of the bus organization is that it creates a communication bottleneck. When I/O must pass through a single bus, the bandwidth of that bus can limit the maximum I/O throughput. The maximum bus speed is also largely limited by: (a) The length of the bus. (b) The number of I/O devices on the bus. (C) And the need to support a wide range of devices with a widely varying latencies and data transfer rates.

Types of Buses Processor-Memory Bus (proprietary)
Short and high speed Matched to the memory system to maximize the memory-processor bandwidth Optimized for cache block transfers I/O Bus (industry standard, e.g., SCSI, USB, ISA, IDE) Usually is lengthy and slower Needs to accommodate a wide range of I/O devices Connects to the processor-memory bus or backplane bus Backplane Bus (industry standard, e.g., PCI) The backplane is an interconnection structure within the chassis Used as an intermediary bus connecting I/O busses to the processor-memory bus Buses are traditionally classified as one of 3 types: processor memory buses, I/O buses, or backplane buses. The processor memory bus is usually design specific while the I/O and backplane buses are often standard buses. In general processor bus are short and high speed. It tries to match the memory system in order to maximize the memory-to-processor BW and is connected directly to the processor. I/O bus usually is lengthy and slow because it has to match a wide range of I/O devices and it usually connects to the processor-memory bus or backplane bus. Backplane bus receives its name because it was often built into the backplane of the computer--it is an interconnection structure within the chassis. It is designed to allow processors, memory, and I/O devices to coexist on a single bus so it has the cost advantage of having only one single bus for all components. +2 = 16 min. (X:56)

A Two Bus System Processor-Memory Bus Processor I/O Bus Adaptor Memory I/O buses tap into the processor-memory bus via Bus Adaptors (that do speed matching between buses) Processor-memory bus: mainly for processor-memory traffic I/O busses: provide expansion slots for I/O devices Here is an example using two buses where multiple I/O buses tap into the processor-memory bus via bus adaptors. The Processor-memory bus is used mainly for processor-memory traffic while the I/O buses are used to provide expansion slots for the I/O devices. Two examples are the popular PCI processor-memory bus and the SCSI I/O bus standards. +2 = 25 min. (Y:05)

A Three Bus System Processor-Memory Bus Processor Bus Adaptor Backplane Bus Memory Bus Adaptor I/O Bus A small number of Backplane Buses tap into the Processor- Memory Bus Processor-Memory Bus is used for processor memory traffic I/O buses are connected to the Backplane Bus Advantage: loading on the Processor-Memory Bus is greatly reduced Finally, in a 3-bus system, a small number of backplane buses (in our example here, just 1) tap into the processor-memory bus. The processor-memory bus is used mainly for processor memory traffic while the I/O buses are connected to the backplane bus via bus adaptors. An advantage of this organization is that the loading on the processor-memory bus is greatly reduced because of the small number of taps into the high-speed processor-memory bus. +1 = 26 min. (Y:06)

I/O System Example (Apple Mac 7200)
Typical of midrange to high-end desktop system in 1997 Processor Processor-Memory Bus Cache Memory Audio I/O Serial ports PCI Interface/ Memory Controller Main Memory I/O Controller I/O Controller PCI CDRom I/O Controller I/O Controller Disk SCSI bus Graphic Terminal Network Tape

Example: Pentium System Organization
Processor-Memory Bus Memory controller (“Northbridge”) PCI Bus I/O Busses

A Bus Transaction A bus transaction includes three parts:
Gaining access to the bus arbitration Issuing the command (and address) request Transferring the data action Gaining access to the bus How is the bus reserved by a devices that wishes to use it? Chaos is avoided by a master-slave arrangement The bus master initiates and controls all bus requests In the simplest system: The processor is the only bus master Major drawback - the processor must be involved in every bus transaction Bus Master Slave Control: Master initiates requests Data can go either way The bus master is the one who starts the bus transaction by sending out the address. The slave is the one who responds to the master by either sending data to the master if the master asks for data. Or the slave may end up receiving data from the master if the master wants to send data. In most simple I/O operations, the processor will be the bus master but as I will show you later in today’s lecture, this is not always be the case. Taking about trying to get onto the bus: how does a device get onto the bus anyway? If everybody tries to use the bus at the same time, chaos will result. Chaos is avoided by a maser-slave arrangement where only the bus master is allow to initiate and control bus requests. The slave has no control over the bus. It just responds to the master’s response. Pretty sad. In the simplest system, the processor is the one and ONLY one bus master and all bus requests must be controlled by the processor. The major drawback of this simple approach is that the processor needs to be involved in every bus transaction and can use up too many processor cycles.

Single Master Bus Transaction
All bus requests are controlled by the processor it initiates the bus cycle on behalf of the requesting device Processor Memory Control Data Step 1: Disk wants to use the bus so it generates a bus request to processor Processor Control Data Memory Step 2: Processor responds and generates appropriate control signals Step 1: the device generates a bus request to indicate to the processor that it wants to use the bus Step 2: the processor responds and sends the bus control signals; e.g., if the device wants to perform output from memory to the disk, the processor asserts the read request lines to memory Step 3: the processor also notifies the device that its bus request is being processed; as a result, the device knows it can use the bus and places the address for the request on the bus A major drawback is that the processor must be involved in every bus transaction. A single sector read from a disk may require the processor to get involved in hundreds to thousands of times, depending on the size of the sector. Processor Control Data Memory Step 3: Processor gives slave (disk) permission to use the bus

Multiple Potential Bus Masters: Arbitration
Bus arbitration scheme: A bus master wanting to use the bus asserts the bus request A bus master cannot use the bus until its request is granted A bus master must release the bus after its use Bus arbitration schemes usually try to balance two factors: Bus priority - the highest priority device should be serviced first Fairness - Even the lowest priority device should never be completely locked out from using the bus Bus arbitration schemes can be divided into four broad classes Daisy chain arbitration: all devices share 1 request line Centralized, parallel arbitration: multiple request and grant lines Distributed arbitration by self-selection: each device wanting the bus places a code indicating its identity on the bus Distributed arbitration by collision detection: Ethernet uses this A more aggressive approach is to allow multiple potential bus masters in the system. With multiple potential bus masters, a mechanism is needed to decide which master gets to use the bus next. This decision process is called bus arbitration and this is how it works. A potential bus master (which can be a device or the processor) wanting to use the bus first asserts the bus request line and it cannot start using the bus until the request is granted. Once it finishes using the bus, it must tell the arbiter that it is done so the arbiter can allow other potential bus master to get onto the bus. All bus arbitration schemes try to balance two factors: bus priority and fairness. Priority is self explanatory. Fairness means even the device with the lowest priority should never be completely locked out from the bus. Bus arbitration schemes can be divided into four broad classes. In the first one: (a) Each device wanting the bus places a code indicating its identity on the bus. (b) By examining the bus, the device can determine the highest priority device that has made a request and decide whether it can get on. In the second scheme, each device independently requests the bus and collision will result in garbage on the bus if multiple request occurs simultaneously. Each device will detect whether its request result in a collision and if it does, it will back off for an random period of time before trying again. The Ethernet you use for your workstation uses this scheme. We will talk about the 2nd scheme only. +3 = 38 min. (Y:18)

Centralized Parallel Arbitration
Device 1 Device 2 Device N Grant1 Grant2 GrantN Req Bus Arbiter Control Data Used in essentially all backplane and high-speed I/O busses

Synchronous and Asynchronous Buses
Includes a clock in the control lines A fixed protocol for communication that is relative to the clock Advantage: involves very little logic and can run very fast Disadvantages: Every device on the bus must run at the same clock rate To avoid clock skew, they cannot be long if they are fast Asynchronous Bus It is not clocked, so requires handshaking protocol (req, ack) Implemented with additional control lines Advantages: Can accommodate a wide range of devices Can be lengthened without worrying about clock skew or synchronization problems Disadvantage: slow(er) There are substantial differences between the design requirements for the I/O buses and processor-memory buses and the backplane buses. Consequently, there are two different schemes for communication on the bus: synchronous and asynchronous. Synchronous bus includes a clock in the control lines and a fixed protocol for communication that is relative to the clock. Since the protocol is fixed and everything happens with respect to the clock, it involves very logic and can run very fast. Most processor-memory buses fall into this category. Synchronous buses have two major disadvantages: (1) every device on the bus must run at the same clock rate. (2) And if they are fast, they must be short to avoid clock skew problem. By definition, an asynchronous bus is not clocked so it can accommodate a wide range of devices at different clock rates and can be lengthened without worrying about clock skew. The draw back is that it can be slow and more complex because a handshaking protocol is needed to coordinate the transmission of data between the sender and receiver. +2 = 28 min. (Y:08)

Asynchronous Handshaking Protocol
Output (read) data from memory to an I/O device. ReadReq Data Ack DataRdy addr data 1 2 3 4 6 5 7 I/O device signals a request by raising ReadReq and putting the addr on the data lines Memory sees ReadReq, reads addr from data lines, and raises Ack I/O device sees Ack and releases the ReadReq and data lines Memory sees ReadReq go low and drops Ack When memory has data ready, it places it on data lines and raises DataRdy I/O device sees DataRdy, reads the data from data lines, and raises Ack Memory sees Ack, releases the data lines, and drops DataRdy I/O device sees DataRdy go low and drops Ack ReadReq: used to indicate a read request for memory. The address is put on the data lines at the same time. DataRdy: used to indicate that the data word is now ready on the data lines. In an output transaction, the memory will assert this signal. In an input transaction, the I/O device will assert it. Ack: used to acknowledge the ReadReq or the DataRdy signal of the other party. The control signals ReadReq and DataRdy are asserted until the other party indicates that the control lines have been seen and the data lines have been read; This indication is made by asserting the Ack line - HANDSHAKING.

Key Characteristics of Two Bus Standards
PCI SCSI Type backplane I/O Data bus width 32 or 64 8 to 32 Addr/data muxed? multiplexed # of masters multiple Arbitration centralized self-selection Clocking synchronous ( MHz) asynchronous Peak bandwidth MB/sec 5 MB/sec Typical bandwidth 80 MB/sec 1.5 MB/sec Max. devices 32 per bus segment 7 to 31 Max. length 0.5 meters 25 meters

Review: Major Components of a Computer
Processor Devices Control Input Memory Datapath Output That is, any computer, no matter how primitive or advance, can be divided into five parts: 1. The input devices bring the data from the outside world into the computer. 2. These data are kept in the computer’s memory until ... 3. The datapath request and process them. 4. The operation of the datapath is controlled by the computer’s controller. All the work done by the computer will NOT do us any good unless we can get the data back to the outside world. 5. Getting the data back to the outside world is the job of the output devices. The most COMMON way to connect these 5 components together is to use a network of busses. Workstation Design Target: 25% of cost on Processor, 25% of cost on Memory (minimum memory size), rest on I/O devices, power supplies, box

A Typical Memory Hierarchy
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. On-Chip Components Control eDRAM Secondary Memory (Disk) Cache Instr Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in detail in the next lectures on caches). Cache Data RegFile DTLB Speed (ns): ’s ’s ’s ’s ,000’s Size (bytes): ’s K’s K’s M’s T’s Cost: highest lowest

Characteristics of the Memory Hierarchy
Processor Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 1 block 1,023+ bytes (disk sector = page) 8-32 bytes (block) Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level

Memory Hierarchy Technologies
Random Access “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory High density (1 transistor cells), low power, cheap, slow Dynamic: need to be “refreshed” regularly (~ every 8 ms) SRAM: Static Random Access Memory Low density (6 transistor cells), high power, expensive, fast Static: content will last “forever” (until power turned off) Size: DRAM/SRAM 4 to 8 Cost/Cycle time: SRAM/DRAM 8 to 16 “Non-so-random” Access Technology Access time varies from location to location and from time to time (e.g., Disk, CDROM) The technology we used to build our memory hierarchy can be divided into two categories: Random Access and Non-so-Random Access. Unlike all other aspects of life where the word random usually associates with bad things, random, when associates with memory access, for the lack of a better word, is good! Because random access means you can access any random location at any time and the access time will be the same as any other random locations. Which is NOT the case for disks or tape where the access time for a given location at any time can be quite different from some other random locations at some other random time. As far as Random Access technology is concerned, we will concentrate on two specific technologies: Dynamic RAM and Static RAM. The advantages of Dynamic RAMs are high density, low cost, and low power so we can have a lot of them without burning a hole in our budget or our desktop. The disadvantages of DRAM are they are slow. Also they will forget what you tell them if you don’t remind them constantly (Refresh). SRAM only has one redeeming feature: it is fast. Other than that, they have low density, expensive, and burn a lot of power. Oh, SRAM actually has another redeeming feature. They will not forget what you tell them. They will keep whatever you write to them forever. Well “forever” is a long time. So lets just say it will keep your data as long as you don’t pull the plug on your computer. In the next two lectures, we will be focusing on DRAMs and SRAMs.

Classical SRAM Organization (~Square)
bit (data) lines r o w d e c Each intersection represents a 6-T SRAM cell RAM Cell Array word (row) select Put multiple words in one memory row – splits the decoder into two decoders (row and column) and makes the memory core square reducing the length of the bit lines (but increasing the length of the word lines). The lsb part of the address goes into the column decoder (e.g., 6 bits so that 64 words are assigned to one row (with 32 bits per word gives 2**11 bit line pairs) leaving 14 bits for the row decoder (giving 2**14 word lines) for an not quite square array. This scheme is good only for up to 64 Kb to 256 Kb. For bigger memories it is too SLOW because the word and bit lines are too long. SRAM allows you to read an entire row out at a time at a word. Each row control line is referred to as the word line and each vertical data line is referred to as the bit line. Column Selector & I/O Circuits column address row address One memory row holds a block of data, so the column address selects the requested word from that block data word

Classical DRAM Organization (~Square Planes)
bit (data) lines . . . RAM Cell Array r o w d e c Each intersection represents a 1-T DRAM cell word (row) select Similar to SRAM, DRAM is organized into rows and columns. But unlike SRAM, which allows you to read an entire row out at a time at a word, classical DRAM only allows you read out one-bit at time time. So we need several (planes) of them to store one word. The reason for this is to save power as well as area. Remember now the DRAM cell is very small we have a lot of them across horizontally. So it will be very difficult to build a Sense Amplifier for each column due to the area constraint not to mention having a sense amplifier per column will consume a lot of power. You select the bit you want to read or write by supplying a Row and then a Column address. Similar to SRAM, each row control line is referred to as the word line and each vertical data line is referred to as the bit line. column address Column Selector & I/O Circuits row address The column address selects the requested bit from the row in each plane . . . data bit data bit data bit data word

RAM Memory Definitions
Caches use SRAM for speed Main Memory is DRAM for density Addresses divided into 2 halves (row and column) RAS or Row Access Strobe triggering row decoder CAS or Column Access Strobe triggering column selector Performance of Main Memory DRAMs Latency: Time to access one word Access Time: time between request and when word arrives Cycle Time: time between requests Usually cycle time > access time Bandwidth: How much data can be supplied per unit time width of the data channel * the rate at which it can be used

Classical DRAM Operation
Column Address DRAM Organization: N rows x N column x M-bit Read or Write M-bit at a time Each M-bit access requires a RAS / CAS cycle N cols DRAM Row Address N rows M bits M-bit Output Cycle Time Another performance booster for DRAM is fast page mode operation. In normal DRAM, we can only read and write M-bit at time because only one row and one column is selected at any time by the row and column address. 1) RAS 2) Latch 3) Cas 4) Latch 5) Data In other words, for each M-bit memory access, we have to provided a row address followed by a column address. Very time consuming. So the engineers get smart and say: “Wait a minute, this is silly, why don’t we put a N x M register here so we can save an entire row internally whenever we access a row?” 1st M-bit Access 2nd M-bit Access RAS CAS Row Address Col Address Row Address Col Address

Ways to Improve DRAM Performance
Memory interleaving Fast Page Mode DRAMs – FPM DRAMs Extended Data Out DRAMs – EDO DRAMs Synchronous DRAMS – SDRAMS Rambus DRAMS Double Data Rate DRAMs – DDR DRAMS . . .

Increasing Bandwidth - Interleaving
Access pattern without Interleaving: Cycle Time CPU Memory Access Time D1 available Start Access for D1 D2 available Start Access for D2 CPU Memory Bank 1 Bank 0 Bank 3 Bank 2 Access pattern with 4-way Interleaving: Access Bank 0 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again Without interleaving, the frequency of our access will be limited by the DRAM cycle time. With interleaving, that is having multiple banks of memory, we can access the memory much more frequently by accessing another bank while the last bank is finishing up its cycle. For example, first we will access memory bank 0. Once we get the data from Bank 0, we will access Bank 1 while Bank 0 is still finishing up the rest of its DRAM cycle. Ideally, with interleaving, how quickly we can perform memory access will be limited by the memory access time only. Memory interleaving is one common techniques to improve memory performance. + 1 = 68 min. (Y:48)

Problems with Interleaving
How many banks? Ideally, the number of banks  number of clocks we have to wait to access the next word in the bank Only works for sequential accesses (i.e., first word requested in first bank, second word requested in second bank, etc.) Increasing DRAM sizes => fewer chips => harder to have banks Growth bits/chip DRAM : 50%-60%/yr Only can use for very large memory systems (e.g., those encountered in supercomputer systems)

Fast Page Mode DRAM Operation
Column Address Fast Page Mode DRAM N x M “SRAM” to save a row N cols DRAM Row Address After a row is read into the SRAM “register” Only CAS is needed to access other M-bit blocks on that row RAS remains asserted while CAS is toggled N rows M bits N x M “SRAM” M-bit Output So with this register in place, all we need to do is assert the RAS to latch in the row address, then entire row is read out and save into this register. After that, you only need to provide the column address and assert the CAS needs to access other M-bit within this same row. This type of operation where RAS remains asserted while CAS is toggled to bring in a new column address is called Page Mode operation. Store so don’t have to repeat: SRAM + 2 = 71 min. (Y:51) Row Address CAS RAS Col Address 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit

Why Care About the Memory Hierarchy?
Processor-DRAM Memory Gap µProc 60%/year (2X/1.5yr) 1000 CPU “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) 100 Performance 10 DRAM 9%/year (2X/10yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Memory Hierarchy: Goals
Fact: Large memories are slow, fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? by taking advantage of The Principle of Locality: Programs access a relatively small portion of the address space at any instant of time. The principle of locality states that programs access a relatively small portion of the address space at any instant of time. Address Space 2n - 1 Probability of reference

Memory Hierarchy: Why Does it Work?
Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory To Processor Upper Level Memory How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55) Blk X From Processor Blk Y

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (Block X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level Time to deliver the block the processor Hit Time << Miss Penalty Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy.

How is the Hierarchy Managed?
registers <-> memory by compiler (programmer?) cache <-> main memory by the hardware main memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files)

Summary DRAM is slow but cheap and dense
Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense Good choice for providing the user FAST access time Two different types of locality Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time.

[Adapted from Dave Patterson’s UCB CS152 slides and

Similar presentations

Presentation on theme: "[Adapted from Dave Patterson’s UCB CS152 slides and"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

[Adapted from Dave Patterson’s UCB CS152 slides and

Similar presentations

Presentation on theme: "[Adapted from Dave Patterson’s UCB CS152 slides and"— Presentation transcript:

Similar presentations

About project

Feedback