General information CSE 243 : Introduction to Computer Architecture and Hardware /Software Interface. Instructor : Swapna S. Gokhale Phone :

CSE 243: Introduction to Computer Architecture and Hardware/Software Interface

General information CSE 243 : Introduction to Computer Architecture and Hardware /Software Interface. Instructor : Swapna S. Gokhale Phone : Office : UTEB 468 Lecture time : TuTh 2:00pm – 3:15 pm. Office hours : By appointment. (I will hang around for a few minutes at the end of each class). Web page : (Lecture notes, homeworks, homework solutions etc. will be posted on the web page) TA : Tobias Wolfertshofer Office hours : Discuss with the TA

Course Objective Describe the general organization and architecture of computers. Identify computers’ major components and study their functions. Introduce hardware design issues of modern computer architectures. Build the required skills to read and research the current literature in computer architecture. Learn assembly language programming.

Textbooks “Computer Organization,” by Carl Hamacher, Zvonko Vranesic and Safwat Zaky. Fifth Edition McGraw-Hill, 2002. “SPARC Architecture, Assembly Language Programming and C,” Richard P. Paul, Prentice Hall, 2000.

Course topics Introduction (Chapter 1): Basic concepts, overall organization. Addressing methods (Chapter 2): fetch/execute cycle, basic addressing modes, instruction sequencing, assembly language and stacks. CISC vs. RISC architectures. Examples of ISAs (Chapter 3): instruction set architecture and ARM instruction set architecture CPU architecture (Chapter 7): Single-bus CPU, Multiple-bus CPU Hardware control, and Microprogrammed control. Arithmetic (Chapter 6): Integer arithmetic and floating-point arithmetic. Memory architecture (Chapter 5): Memory hierarchy, Primary memory, Cache memory, virtual memory. Input/Output organization (Chapter 4): I/O device addressing, I/O data transfers, Synchronization, DMA, Interrupts, Channels, Bus transfers, and Interfacing. I/O Devices (Chapter 5): Disk systems.

Grading System Exam #1: (8%)
- Addressing methods, CISC and RISC architectures. Exam #2 (8%) - CPU Architecture Exam #3 (8%) - Arithmetic Exam #4 (8%) - Memory architecture. Final (28%) - All topics. Homework assignments (15%) - 4 homework assignments. Lab Assignments/Projects (25%)

Course topics, exams and assignment calendar
Week #1 (Jan 27): - Addressing methods. Week #2 (Feb 3): - Instruction Set Architectures of and ARM processors. - Assignment #1 handed out. Week #3 (Feb 10): - Problem solving -- assembly language programming. - CPU Architecture. Week #4 (Feb 17): - Assignment #1 due, Assignment #2 handed out. - Exam #1 (Addressing modes, etc.). Week #5 (Feb 24) Week #6 (March 3) - Arithmetic. - Assignment #2 due, Assignment #3 handed out. - Exam #2 (CPU Architecture)

Week #7 (March 10): - Arithmetic Week #8 (March 24): - Memory architecture. - Assignment #3 due. - Exam #3 (arithmetic). Week #9 (March 31): - Assignment #4 handed out. Week #10 (April 7): - Memory organization - I/O organization. Week #11 (April 14): - Assignment #4 due. - Exam #4 (Memory architecture).

Week #12 (April 21): - I/O organization - I/O devices. Week #13 (April 28) - Disk systems - Pipelining Week #14: (May 5)

Grading policy Refer to the University policy regarding Student Conduct (Plagiarism, etc.) Grading of assignments/exams is handled by the TA, if you cannot resolve a problem with the TA, see me. Assignments may be submitted by . Hard copy will also be accepted, but you have to submit in the department office to stamp the date. Please submit all the assignments to the TA. Late assignments are penalized by a loss of 33% per day late (so 3 days late is as late as you can get). Solution will be posted on the course web page No assignments will be accepted after the solution is posted. The weeks during which exams will be held have been announced. The actual day of that week, (Tuesday or Thursday) when the exam will be held will be announced two weeks prior to the exam. It will be also posted on the course web page at the same time. If you have any conflict with the exam date, please see me in advance.

Important prerequisite material
CSE 207/208 is fundamental to CSE 243. Review issues in: - Basic computer organization: CPU, Memory, I/O, Registers. - Fundamentals of combinatorial design and sequential design. - Simple ALU, simple register design.

Reading Reading the text is imperative.
Computer architecture especially processor design, changes rapidly. You really have to keep up with the changes in the industry. This is especially important for job interviews later.

Labs Lab times (as per catalog) are not fixed. Work can be done outside of scheduled lab hours. But expect to spend at least the time allotted in the lab. Lab assignments will involve tkisem. tkisem is a tool from University of New Mexico. Information can be found at: tkisem is a Tcl/Tk version of isem (Instructional Sparc Emulator). tkisem involves assembly language programming. If you have PC (Mac running virtual PC) you can install tkisem on your own machines. Otherwise tkisem will be in the learning center. Lab assignments should be submitted electronically to the TA.

Feedback Please provide informal feedback early and often, before the formal review process.

What is a computer? Simply put, a computer is a sophisticated electronic calculating machine that: Accepts input information, Processes the information according to a list of internally stored instructions and Produces the resulting output information. Functions performed by a computer are: Accepting information to be processed as input. Storing a list of instructions to process the information. Processing the information according to the list of instructions. Providing the results of the processing as output. What are the functional units of a computer?

Functional units of a computer
Input unit accepts information: Human operators, Electromechanical devices Other computers Arithmetic and logic unit(ALU): Performs the desired operations on the input information as determined by instructions in the memory Memory Arithmetic & Logic Input Instr1 Instr2 Instr3 Data1 Data2 Output Control I/O Processor Stores information: Instructions, Data Control unit coordinates various actions Input, Output Processing Output unit sends results of processing: To a monitor display, To a printer

Information in a computer -- Instructions
Instructions specify commands to: Transfer information within a computer (e.g., from memory to ALU) Transfer of information between the computer and I/O devices (e.g., from keyboard to computer, or computer to printer) Perform arithmetic and logic operations (e.g., Add two numbers, Perform a logical AND). A sequence of instructions to perform a task is called a program, which is stored in the memory. Processor fetches instructions that make up a program from the memory and performs the operations stated in those instructions. What do the instructions operate upon?

Information in a computer -- Data
Data are the “operands” upon which instructions operate. Data could be: Numbers, Encoded characters. Data, in a broad sense means any digital information. Computers use data that is encoded as a string of binary digits called bits.

Input unit Real world Computer Memory Input Unit Keyboard Audio input
Binary information must be presented to a computer in a specific format. This task is performed by the input unit: - Interfaces with input devices. - Accepts binary information from the input devices. - Presents this binary information in a format expected by the computer. - Transfers this information to the memory or processor. Real world Computer Memory Input Unit Keyboard Audio input …… Processor

Memory unit Memory unit stores instructions and data.
Recall, data is represented as a series of bits. To store data, memory unit thus stores bits. Processor reads instructions and reads/writes data from/to the memory during the execution of a program. In theory, instructions and data could be fetched one bit at a time. In practice, a group of bits is fetched at a time. Group of bits stored or retrieved at a time is termed as “word” Number of bits in a word is termed as the “word length” of a computer. In order to read/write to and from memory, a processor should know where to look: “Address” is associated with each word location.

Memory unit (contd..) Processor reads/writes to/from memory based on the memory address: Access any word location in a short and fixed amount of time based on the address. Random Access Memory (RAM) provides fixed access time independent of the location of the word. Access time is known as “Memory Access Time”. Memory and processor have to “communicate” with each other in order to read/write information. In order to reduce “communication time”, a small amount of RAM (known as Cache) is tightly coupled with the processor. Modern computers have three to four levels of RAM units with different speeds and sizes: Fastest, smallest known as Cache Slowest, largest known as Main memory. Since the instructions and data need to be feteched from the memory in order to perform a task, the time it takes to access and fetch this information will be one factor influencing how fast a given task will complete. In order to increase the speed of performing a task, one way is to reduce the amount of time it takes to fetch the data and the instructions. This time is called as “access time”. Suppose if we want to fetch the data at memory location with the address 10. In case of sequential access, we have to access locations 1-9, and then access location 10. Clearly, in case of sequential access the access times increase as memory locations with higher access times are accessed. We need some kind of memory which provides fixed and short access time irrespective of the memory location being accessed. That is, it provides random access. Why is the access time faster for the Cache than it is for primary storage? I haven’t yet discussed how the various units communicate with each other. In a few minutes I will discuss that, and it will become clear.

Memory unit (contd..) Primary storage of the computer consists of RAM units. Fastest, smallest unit is Cache. Slowest, largest unit is Main Memory. Primary storage is insufficient to store large amounts of data and programs. Primary storage can be added, but it is expensive. Store large amounts of data on secondary storage devices: Magnetic disks and tapes, Optical disks (CD-ROMS). Access to the data stored in secondary storage in slower, but take advantage of the fact that some information may be accessed infrequently. Cost of a memory unit depends on its access time, lesser access time implies higher cost.

Arithmetic and logic unit (ALU)
Operations are executed in the Arithmetic and Logic Unit (ALU). Arithmetic operations such as addition, subtraction. Logic operations such as comparison of numbers. In order to execute an instruction, operands need to be brought into the ALU from the memory. Operands are stored in general purpose registers available in the ALU. Access times of general purpose registers are faster than the cache. Results of the operations are stored back in the memory or retained in the processor for immediate use.

Output unit Computer Real world Memory Printer Graphics display
Computers represent information in a specific binary form. Output units: - Interface with output devices. - Accept processed results provided by the computer in specific binary form. - Convert the information in binary form to a form understood by an output device. Computer Real world Memory Printer Graphics display Speakers …… Output Unit Processor

Control unit Operation of a computer can be summarized as:
Accepts information from the input units (Input unit). Stores the information (Memory). Processes the information (ALU). Provides processed results through the output units (Output unit). Operations of Input unit, Memory, ALU and Output unit are coordinated by Control unit. Instructions control “what” operations take place (e.g. data transfer, processing). Control unit generates timing signals which determines “when” a particular operation takes place.

How are the functional units connected?
For a computer to achieve its operation, the functional units need to communicate with each other. In order to communicate, they need to be connected. Input Output Memory Processor Bus What is a word? What is a word length? During the discussion of which functional unit did we come across this concept? Functional units may be connected by a group of parallel wires. The group of parallel wires is called a bus. Each wire in a bus can transfer one bit of information. The number of parallel wires in a bus is equal to the word length of a computer

Organization of cache and main memory
Processor memory memory Bus Why is the access time of the cache memory lesser than the access time of the main memory?

Execution of an instruction
Recall the steps involved in the execution of an instruction by a processor: Fetch an instruction from the memory. Fetch the operands. Execute the instruction. Store the results. Several issues: Where is the address of the memory location from which the present instruction is to be fetched? Where is the present instruction stored while it is executed? Where and what is the address of the memory location from which the data is fetched? ...... Basic processor architecture has several registers to assist in the execution of the instructions.

Basic processor architecture
Address of the memory location to be accessed Memory Address of the next instruction to be fetched and executed. Data to be read into or read out of the current location MAR MDR Control PC R0 General purpose registers R1 IR Instruction that is currently being executed ALU R(n-1) - n general purpose registers Processor

Basic processor architecture (contd..)
Control Path Data Path MAR MDR Processor Control path is responsible for: Instruction fetch and execution sequencing Operand fetch Saving results Data path: Contains general purpose registers Contains ALU Executes instructions Memory

Registers in the control path
Instruction Register (IR): Instruction that is currently being executed. Program Counter (PC): Address of the next instruction to be fetched and executed. Memory Address Register (MAR): Address of the memory location to be accessed. Memory Data Register (MDR): Data to be read into or read out of the current memory location, whose address is in the Memory Address Register (MAR).

Fetch/Execute cycle Execution of an instruction takes place in two phases: Instruction fetch. Instruction execute. Instruction fetch: Fetch the instruction from the memory location whose address is in the Program Counter (PC). Place the instruction in the Instruction Register (IR). Instruction execute: Instruction in the IR is examined (decoded) to determine which operation is to be performed. Fetch the operands from the memory or registers. Execute the operation. Store the results in the destination location. Basic fetch/execute cycle repeats indefinitely.

Memory organization Recall:
Information is stored in the memory as a collection of bits. Collection of bits are stored or retrieved simultaneously is called a word. Number of bits in a word is called word length. Word length can be 16 to 64 bits. Another collection which is more basic than a word: Collection of 8 bits known as a “byte” Bytes are grouped into words, word length can also be expressed as a number of bytes instead of the number of bits: Word length of 16 bits, is equivalent to word length of 2 bytes. Words may be 2 bytes (older architectures), 4 bytes (current architectures), or 8+ bytes (modern architectures).

Memory organization (contd..)
Accessing the memory to obtain information requires specifying the “address” of the memory location. Recall that a memory has a sequence of bits: Assigning addresses to each bit is impractical and unnecessary. Typically, addresses are assigned to a single byte. “Byte addressable memory” Suppose k bits are used to hold the address of a memory location: Size of the memory in bytes is given by: 2k where k is the number of bits used to hold a memory address. E.g., for a 16-bit address, size of the memory is 216= bytes What is the size of the memory for a 24-bit address?

Byte 0 Memory is viewed as a sequence of bytes. Address of the first byte is 0 Address of the last byte is 2k - 1, where k is the number of bits used to hold memory address E.g. when k = 16, Address of the last byte is 65535 E.g. when k = 2, Address of the first byte is ? Address of the last byte is ? Byte 2k-1

Word #0 Byte 0 Byte 1 Consider a memory organization: 16-bit memory addresses Size of the memory is ? Word length is 4 bytes Number of words = Memory size(bytes) = ? Word length(bytes) Word #0 starts at Byte #0. Word #1 starts at Byte #4. Last word (Word #?) starts at Byte#? Byte 2 Byte 3 Word #1 Byte 4 Word #? Byte 65532 Byte 65533 Byte 65534 Byte 65535

Byte 0 Word #0 Byte 1 Byte 2 MAR Byte 3 Byte 4 Word #1 MDR MAR register contains the address of the memory location addressed Addr 65532 Byte 65532 Word #16383 Byte 65533 MDR contains either the data to be written to that address or read from that address. Byte 65534 Byte 65535

Memory operations Memory read or load: Memory write or store:
Place address of the memory location to be read from into MAR. Issue a Memory_read command to the memory. Data read from the memory is placed into MDR automatically (by control logic). Memory write or store: Place address of the memory location to be written to into MAR. Place data to be written into MDR. Issue Memory_write command to the memory. Data in MDR is written to the memory automatically (by control logic).

Instruction types Computer instructions must be capable of performing 4 types of operations. Data transfer/movement between memory and processor registers. E.g., memory read, memory write Arithmetic and logic operations: E.g., addition, subtraction, comparison between two numbers. Program sequencing and flow of control: Branch instructions Input/output transfers to transfer data to and from the real world.

Instruction types (contd..)
Examples of different types of instructions in assembly language notation. Data transfers between processor and memory. Move A, B (B = A). Move A, R1 (R1 = A). Arithmetic and logic operation: Add A, B, C (C = A + B) Sequencing: Jump Label (Jump to the subroutine which starts at Label). Input/output data transfer: Input PORT, R5 (Read from i/o port “PORT” to register R5).

Specifying operands in instructions
Operands are the entities operated upon by the instructions. Recall that operands may have to be fetched from a memory location to execute an operation. Memory locations have addresses using which they can be accessed. Operands may also be stored in the general purpose registers. Intermediate value of some computation which will be required immediately for subsequent computations. Registers also have addresses. Specifying the operands on which the instruction is to operate involves specifying the addresses of the operands. Address can be of a memory location or a register.

Source and destination operands
Operation may be specified as: Operation source1, source2, destination An operand is called a source operand if: It appears on the right-hand side of an expression E.g., Add A, B, C (C = A+ B) A and B are source operands. An operand is called a destination operand if: It appears on the left-hand side of an expression. E.g., Add A, B, C (C = A + B) C is a destination operand.

Source and destination operands (contd..)
In case of some instructions, the same operand serves as both the source and the destination. Same operand appears on the right and left side of an expression. E.g. Add A, B (B = A + B) B is both the source and the destination operand. Another classification of instructions is based on the number of operand addresses in the instruction.

Instruction types Instructions can also be classified based on the number of operand addresses they include. 3, 2, 1, 0 operand addresses. 3-address instructions are almost always instructions that implement binary operations. E.g. Add A, B, C (C = A + B) k bits are used to specify the address of a memory location, then 3-address instructions need 3*k bits to specify the operand addresses. 3-address instructions, where operand addresses are memory locations are too big to fit in one word.

Instruction types (contd..)
2-address instructions one operand serves as a source and destination: E.g. Add A, B (B = A + B) 2-address instructions need 2*k bits to specify an instruction. This may also be too big to fit into a word. 2-address instructions, where at least one operand is a processor register: E.g. Add A, R1 (R1 = A + R1) 1-address instructions require only one operand. E.g. Clear A (A = 0) 0-address instructions do not operate on operands. E.g. Halt (Halt the computer) How are addresses of operands specified in the instructions?

Addressing modes Different ways in which the address of an operand in specified in an instruction is referred to as addressing modes. Register mode Operand is the contents of a processor register. Address of the register is given in the instruction. E.g. Clear R1 Absolute mode Operand is in a memory location. Address of the memory location is given explicitly in the instruction. E.g. Clear A Also called as “Direct mode” in some assembly languages Register and absolute modes can be used to represent variables

Addressing modes (contd..)
Immediate mode Operand is given explicitly in the instruction. E.g. Move #200, R0 Can be used to represent constants. Register, Absolute and Immediate modes contained either the address of the operand or the operand itself. Some instructions provide information from which the memory address of the operand can be determined That is, they provide the “Effective Address” of the operand. They do not provide the operand or the address of the operand explicitly. Different ways in which “Effective Address” of the operand can be generated.

Effective Address of the operand is the contents of a register or a memory location whose address appears in the instruction. Add (R1),R0 Add (A),R0 Main memory B Operand A B R1 B Register B Operand Register R1 contains Address B Address A contains Address B Address B has the operand Address B has the operand R1 and A are called “pointers” This is called as “Indirect Mode”

Effective Address of the operand is generated by adding a constant value to the contents of the register Operand is at address 1020 Register R1 contains 1000 Offset 20 is added to the contents of R1 to generate the address 20 Contents of R1 do not change in the process of generating the address R1 is called as an “index register” Add 20(R1),R0 1000 offset = 20 1020 Operand What address would be generated by Add 1000(R1), R0 if R1 had 20? R1 1000 This is the “Indexing Mode”

Addressing Modes (contd..)
Relative mode Effective Address of the operand is generated by adding a constant value to the contents of the Program Counter (PC). Variation of the Indexing Mode, where the index register is the PC instead of a general purpose register. When the instruction is being executed, the PC holds the address of the next instruction in the program. Useful for specifying target addresses in branch instructions. Addressed location is “relative” to the PC, this is called “Relative Mode”

Addressing Modes (contd..)
Autoincrement mode: Effective address of the operand is the contents of a register specified in the instruction. After accessing the operand, the contents of this register are automatically incremented to point to the next consecutive memory location. (R1)+ Autodecrement mode Before accessing the operand, the contents of this register are automatically decremented to point to the previous consecutive memory location. -(R1) Autoincrement and Autodecrement modes are useful for implementing “Last-In-First-Out” data structures.

Implicitly the increment and decrement amounts are 1. This would allow us to access individual bytes in a byte addressable memory. Recall that the information is stored and retrieved one word at a time. In most computers, increment and decrement amounts are equal to the word size in bytes. E.g., if the word size is 4 bytes (32 bits): Autoincrement increments the contents by 4. Autodecrement decrements the contents by 4.

Instruction execution and sequencing
Recall the fetch/execute cycle of instruction execution. In order to complete a meaningful task, a number of instructions need to be executed. During the fetch/execution cycle of one instruction, the Program Counter (PC) is updated with the address of the next instruction: PC contains the address of the memory location from which the next instruction is to be fetched. When the instruction completes its fetch/execution cycle, the contents of the PC point to the next instruction. Thus, a sequence of instructions can be executed to complete a task.

Instruction execution and sequencing (contd..)
Simple processor model Processor has a number of general purpose registers. Word length is 32 bits (4 bytes). Memory is byte addressable. Each instruction is one word long. Instructions allow one memory operand per instruction. One register operand is allowed in addition to one memory operand. Simple task: Add two numbers stored in memory locations A and B. Store the result of the addition in memory location C. Move A, R0 (Move the contents of location A to register R0) Add B, R0 (Add the contents of location B to register R0) Move R0, C (Move the contents of register R0 to location C)

Execution steps: Step I: -PC holds address 0. -Fetches instruction at address 0. -Fetches operand A. -Executes the instruction. -Increments PC to 4. Step II: -PC holds address 4. -Fetches instruction at address 4. -Fetches operand B. -Increments PC to 8. Step III: -PC holds address 8. -Fetches instruction at address 8. -Stores the result in location C. Move A, R0 Add B, R0 Move R0, C A B C 4 8 Instructions are executed one at a time in order of increasing addresses. “Straight line sequencing”

Consider the following task: Add 10 numbers. Number of numbers to be added (in this case 10) is stored in location N. Numbers are located in the memory at NUM1, .... NUM10 Store the result in SUM. Move NUM1, R0 (Move the contents of location NUM1 to register R0) Add NUM2, R0 (Add the contents of location NUM2 to register R0) Add NUM3, R0 (Add the contents of location NUM3 to register R0) Add NUM4, R0 (Add the contents of location NUM4 to register R0) Add NUM5, R0 (Add the contents of location NUM5 to register R0) Add NUM6, R0 (Add the contents of location NUM6 to register R0) Add NUM7, R0 Add NUM8, R0 Add NUM9, R0 Add NUM10, R0 Move R0, SUM (Move the contents of register R0 to location SUM)

Instruction sequencing and execution (contd..)
Separate Add instruction to add each number in a list, leading to a long list of Add instructions. Task can be accomplished in a compact way, by using the Branch instruction. Move N, R1 (Move the contents of location N, which is the number of numbers to be added to register R1) Clear R (This register holds the sum as the numbers are added) LOOP Determine the address of the next number. Add the next number to R0. Decrement R1 (Counter which indicates how many numbers have been added so far). Branch>0 LOOP (If all the numbers haven’t been added, go to LOOP) Move R0, SUM

Decrement R1: Initially holds the number of numbers that is to be added (Move N, R1). Decrements the count each time a new number is added (Decrement R1). Keeps a count of the number of the numbers added so far. Branch>0 LOOP: Checks if the count in register R1 is 0 (Branch > 0) If it is 0, then store the sum in register R0 at memory location SUM (Move R0, SUM). If not, then get the next number, and repeat (go to LOOP). Go to is specified implicitly. Note that the instruction (Branch > 0 LOOP) has no explicit reference to register R1.

Instructions execution and sequencing (contd..)
Processor keeps track of the information about the results of previous operation. Information is recorded in individual bits called “condition code flags”. Common flags are: N (negative, set to 1 if result is negative, else cleared to 0) Z (zero, set to 1 if result is zero, else cleared to 0) V (overflow, set to 1 if arithmetic overflow occurs, else cleared) C (carry, set to 1 if a carry-out results, else cleared) Flags are grouped together in a special purpose register called “condition code register” or “status register”. If the result of Decrement R1 is 0, then flag Z is set. Branch> 0, tests the Z flag. If Z is 1, then the sum is stored. Else the next number is added.

Branch instructions alter the sequence of program execution Recall that the PC holds the address of the next instruction to be executed. Do so, by loading a new value into the PC. Processor fetches and executes instruction at this new address, instead of the instruction located at the location that follows the branch. New address is called a “branch target”. Conditional branch instructions cause a branch only if a specified condition is satisfied Otherwise the PC is incremented in a normal way, and the next sequential instruction is fetched and executed. Conditional branch instructions use condition code flags to check if the various conditions are satisfied.

Instruction sequencing and execution (contd..)
How to determine the address of the next number? Recall the addressing modes: Initialize register R2 with the address of the first number using Immediate addressing. Use Indirect addressing mode to add the first number. Increment register R2 by 4, so that it points to the next number. Move N, R1 Move #NUM1, R2 (Initialize R2 with address of NUM1) Clear R0 LOOP Add (R2), R (Indirect addressing) Add #4, R (Increment R2 to point to the next number) Decrement R1 Branch>0 LOOP Move R3, SUM

Note that the same can be accomplished using “autoincrement mode”: Move N, R1 Move #NUM1, R2 (Initialize R2 with address of NUM1) Clear R0 LOOP Add (R2)+, R (Autoincrement) Decrement R1 Branch>0 LOOP Move R3, SUM

Stacks A stack is a list of data elements, usually words or bytes with the accessing restriction that elements can be added or removed at one end of the stack. End from which elements are added and removed is called the “top” of the stack. Other end is called the “bottom” of the stack. Also known as: Pushdown stack. Last in first out (LIFO) stack. Push - placing a new item on the stack. Pop - Removing the top item from the stack.

Stacks (contd..) Data stored in the memory of a computer can be organized as a stack. Successive elements occupy successive memory locations. When new elements are pushed on to the stack they are placed in successively lower address locations. Stack grows in direction of decreasing memory addresses. A processor register called as “Stack Pointer (SP)” is used to keep track of the address of the element that is at the top at any given time. A general purpose register could serve as a stack pointer.

Stacks (contd..) Processor with 65536 bytes of memory.
4 65535 BOTTOM 43 739 1 2 SP Processor with bytes of memory. Byte addressable memory. Word length is 4 bytes. First element of the stack is at BOTTOM. SP points to the element at the top. Push operation can be implemented as: Subtract #4, SP Move A, (SP) Pop operation can be implemented as: Move (SP), B Add #4, SP Push with autodecrement: Move A, -(SP) Pop with autoincrement: Move (SP)+, A

Subroutines In a program subtasks that are repeated on different data values are usually implemented as subroutines. When a program requires the use of a subroutine, it branches to the subroutine. Branching to the subroutine is called as “calling” the subroutine. Instruction that performs this branch operation is Call. After a subroutine completes execution, the calling program continues with executing the instruction immediately after the instruction that called the subroutine. Subroutine is said to “return” to the program. Instruction that performs this is called Return. Subroutine may be called from many places in the program. How does the subroutine know which address to return to?

Subroutines (contd..) Recall that when an instruction is being executed, the PC holds the address of the next instruction to be executed. This is the address to which the subroutine must return. This addressed must be saved by the Call instruction. Way in which a processor makes it possible to call and return from a subroutine is called “subroutine linkage method”. The return address could be stored in a register called as “Link register” Call instruction: Stores the contents of the PC in a link register. Branches to the subroutine. Return instruction: Branch to the address contained in the link register.

Subroutines (contd..) Calling program calls a subroutine,
Memory Memory Calling program calls a subroutine, whose first instruction is at address 100. The Call instruction is at address 200. While the Call instruction is being executed, the PC points to the next instruction at address 204. Call instructions stores address 204 in the Link register, and loads 1000 into the PC. Return instruction loads back the address 204 from the link register location Calling program location Subroutine SUB 200 Call SUB 1000 first instruction 204 next instruction Return 1000 PC 204 Link 204 Call Return

Subroutines and stack For nested subroutines:
After the last subroutine in the nested list completes execution, the return address is needed to execute the return instruction. This return address is the last one that was generated in the nested call sequence. Return addresses are generated and used in a “Last-In-First-Out” order. Push the return addresses onto a stack as they are generated by subroutine calls. Pop the return addresses from the stack as they are needed to execute return instructions.

Assembly language Recall that information is stored in a computer in a binary form, in a patterns of 0s and 1s. Such patterns are awkward when preparing programs. Symbolic names are used to represent patterns. So far we have used normal words such as Move, Add, Branch, to represent corresponding binary patterns. When we write programs for a specific computer, the normal words need to be replaced by acronyms called mnemonics. E.g., MOV, ADD, INC A complete set of symbolic names and rules for their use constitute a programming language, referred to as the assembly language.

Assembly language (contd..)
Programs written in assembly language need to be translated into a form understandable by the computer, namely, binary, or machine language form. Translation from assembly language to machine language is performed by an assembler. Original program in assembly language is called source program. Assembled machine language program is called object program. Each mnemonic represents the binary pattern, or OP code for the operation performed by the instruction. Assembly language must also have a way to indicate the addressing mode being used for operand addresses. Sometimes the addressing mode is indicated in the OP code mnemonic. E.g., ADDI may be a mnemonic to indicate an addition operation with an immediate operand.

Assembly language allows programmer to specify other information necessary to translate the source program into an object program. How to assign numerical values to the names. Where to place instructions in the memory. Where to place data operands in the memory. The statements which provide additional information to the assembler to translate source program into an object program are called assembler directives or commands.

100 Move N,R1 What is the numeric value assigned to SUM? What is the address of the data NUM1 through NUM100? What is the address of the memory location represented by the label LOOP? How to place a data value into a memory location? 104 Move #NUM1,R2 108 Clear R0 LOOP 112 Add (R2),R0 116 Add #4,R2 120 Decrement R1 124 Branch>0 LOOP 128 Move R0,SUM 132 SUM 200 N 204 100 NUM1 208 NUM2 212 NUM 100 604

EQU: Value of SUM is 200. ORIGIN: Place the datablock at 204. DATAWORD: Place the value 100 at 204 Assign it label N. N EQU 100 RESERVE: Memory block of 400 words is to be reserved for data. Associate NUM1 with address 208 Instructions of the object program to be loaded in memory starting at 100. RETURN: Terminate program execution. END: End of the program source text Memory Addressing address or data lab el Op eration information Assem bler directiv es SUM EQU 200 ORIGIN 204 N D A T W ORD 100 NUM1 RESER VE 400 Statemen ts that ST AR MO N,R1 generate #NUM1,R2 mac hine CLR R0 instructions LOOP ADD (R2),R0 #4,R2 DEC R1 BGTZ R0,SUM RETURN END

Assembly language instructions have a generic form: Label Operation Operand(s) Comment Four fields are separated by a delimiter, typically one or more blank characters. Label is optionally associated with a memory address: May indicate the address of an instruction to be executed. May indicate the address of a data item. How does the assembler determine the values that represent names? Value of a name may be specified by EQU directive. SUM EQU 100 A name may be defined in the Label field of another instruction, value represented by the name is determined by the location of that instruction in the object program. E.g., BGTZ LOOP, the value of LOOP is the address of the instruction ADD (R2) R0

Assembler scans through the source program, keeps track of all the names and corresponding numerical values in a “symbol table”. When a name appears second time, it is replaced with its value from the table. What if a name appears before it is given a value, for example, branch to an address that hasn’t been seen yet (forward branch)? Assembler can scan through the source code twice. First pass to build the symbol table. Second pass to substitute names with numerical values. Two pass assembler.

Encoding of machine instructions
Instructions specify the operation to be performed and the operands to be used. Which operation is to be performed and the addressing mode of the operands may be specified using an encoded binary pattern referred to as the “OP code” for the instruction. Consider a processor with: Word length 32 bits. 16 general purpose registers, requiring 4 bits to specify the register. 8 bits are set aside to specify the OP code. 256 instructions can be specified in 8 bits.

Encoding of machine instructions (contd..)
One-word instruction format. OP code Source Dest Other info 8 7 10 Opcode : 8 bits. Source operand : 4 bits to specify a register 3 bits to specify the addressing mode. Destination operand : 4 bits to specify a register. Other information : 10 bits to specify other information such as index value.

What if the source operand is a memory location specified using the absolute addressing mode? 8 3 7 14 OP code Source Dest Opcode : 8 bits. Source operand : 3 bits to specify the addressing mode. Destination operand : 4 bits to specify a register. 3 bits to specify the addressing mode. Leaves us with 14 bits to specify the address of the memory location. Insufficient to give a complete 32 bit address in the instruction. Include second word as a part of this instruction, leading to a two-word instruction.

Two-word instruction format. OP code Source Dest Other info Memory address/Immediate operand Second word specifies the address of a memory location. Second word may also be used to specify an immediate operand. Complex instructions can be implemented using multiple words. Complex Instruction Set Computer (CISC) refers to processors using instruction sets of this type.

Insist that all instructions must fit into a single 32 bit word: Instruction cannot specify a memory location or an immediate operand. ADD R1, R2 can be specified. ADD LOC, R2 cannot be specified. Use indirect addressing mode: ADD (R3), R2 R3 serves as a pointer to memory location LOC. How to load address of LOC into R3? Relative addressing mode.

Restriction that an instruction must occupy only one word has led to a style of computers that are known as Reduced Instruction Set Computers (RISC). Manipulation of data must be performed on operands already in processor registers. Restriction may require additional instructions for tasks. However, it is possible to specify three operand instructions into a single 32-bit word, where all three operands are in registers: OP code R i j Other info k Three-operand instruction

CSE243: Introduction to Computer Architecture and Hardware/Software Interface

68000 Instruction Set Architecture
Registers and addressing Addressing modes Instructions Assembly language Branch instructions Stacks and subroutines Simple programs in assembly language

Registers and addressing
External word length: Word length used to transfer data between the processor and memory. 16 bits. Internal word length: Refers to the size of the processor registers 32 bits.

Register structure 8 data registers (D0-D7) and 8 address registers (A0-A7): Each is 32 bits long. Instructions use operands of three different lengths: Long word – 32 bits Word – 16 bits Byte – 8 bits A byte or a word operand stored in a data register is in the low order bit positions of a register: Most instructions operating on byte or word operands do not affect the higher order bit positions. Some instructions extend the sign of a shorter operand into the high-order bits.

Register structure (contd..)
Address registers hold information used in determining the addresses of memory operands: Address and data registers can also be used as index registers. Address register A7 has a special function of serving as the processor stack pointer. Address registers and calculations involve 32 bits: Only the least 24 significant bits of an address are used externally to access the memory. Processor Status Register (SR): Five condition code bits. Three interrupt bits (function will be clear when we discuss I/0) Two mode-select bits.

Long word Word Byte 31 16 15 8 7 D0 D1 D2 User stack pointer A7 Stack D3 Supervisor stack pointer pointers Data registers D4 PC D5 15 13 10 8 4 D6 SR Status register D7 T - Trace mode select C - Carry S - Supervisor mode select V - Overflow I - Interrupt mask Z - Zero A0 N - Negative A1 A2 A3 Address registers A4 A5 A6

Addressing Byte addressable memory
Word addresses Contents Byte addressable memory Memory is organized in 16-bit words. Two consecutive 16-bit words constitute one 32-bit long word. Word address must be an even number, that is, words must be aligned on an even boundary. Byte in the high-order position has the same address as the word, the byte in the low-order position has the next higher address. This is the big endian address assignment. byte 0 byte 1 Long word 0 2 byte 2 byte 3 i byte i byte i +1 Long word i i +2 byte i +2 byte i +3 2 24 - 2 byte 2 24 - 2 byte 2 24 - 1

Addressing (contd..) 68000 generates 24-bit memory addresses.
Addressable space is 224 (16,777,216 or 16M) bytes. Addressable space may be thought of as consisting of: 512 (29) pages of 32K (215) bytes each. Page 0 is hexadecimal addresses 0 to 7FFF. Page 1 is hexadecimal addresses 8000 to FFFF. Last page is hexadecimal addresses FF8000 to FFFFF.

Addressing modes Instruction size: Immediate mode
Many instructions are one word (16-bit long), some require additional 16-bit words for extra addressing information. First word of an instruction is the OP-code word. Subsequent words (if any) are called extension words. Note that the does not place a one word limit on the instruction size, it is a CISC architecture. Immediate mode Operand is included in the instruction. Byte, word, long-word operands are specified in the extension word. Some very small operands are included in the OP-code word.

Absolute mode: Absolute address of an operand is given in the extension word. Long mode: 24-bit address is specified. Short mode: 16-bit value is specified, which serves as the low-order 16 bits of an address. Sign bit of the 16-bit value is extended to provide the high-order 8 bits. Sign-bit is 0 or 1, so two pages can be addressed: 0, and FF8. Register mode: Operand is in a processor register specified in the instruction. Register indirect mode: Effective address (EA) of an operand is in a processor register which is specified in an instruction.

Autoincrement mode: EA is in one of the address registers which is specified in the instruction. After the operand is addressed, the contents of the address register are incremented by 1 (byte), 2 (word), 4 (long-word). Autodecrement mode: Same as above, except the contents are decremented before the operand is accessed. Basic index mode: 16-bit signed offset and an address register are specified. Full index mode: 8-bit signed offset, an address register and an index register (either a data or address register) are specified. Basic relative and Full relative modes: Same as Basic index and Full index modes respectively, except PC is used instead of address register.

Consider the instruction: ADD 100(PC,A1), D0 Full relative mode for the source operand. Register mode for the destination operand. Compute the EA of the source operand: - Suppose A1 holds the value 6. - Suppose the OP-code word is at 1000. - After the OP-code word is fetched, PC points to the extension word, i.e EA = [PC] + [A1] + 100 Thus, EA is 1108. This could be used to address the data in an array form. In relative modes, PC need not be specified explicitly in the instruction. 1000 OP-code word 1002 Extension word 100 = offset 1102 1104 6 = index 1106 Array 1108 Operand A1 6

Instructions Extensive set of instructions.
Both two-operand and one-operand instructions. Two-operand instructions: Format: OP src, dst. Assembly language notation, the actual encoding may not be in the same order. OP specifies the operation performed. Result is placed in the destination location. For most two operand instructions, either src, or dst operand must be in a data register. Other operand may be specified using any one of the addressing modes. The only two operand instruction where any addressing mode may be used to specify both the operands is the MOVE instruction.

Instructions (contd..) Consider the instruction: ADD #9, D3
Add 9 to contents of D3 and store result in D3. Bits and 8 specify the operation. Bits 9-11 specify the destination register D3. Bits 7-5 specify the size of the operands, default is word, with a bit pattern 01. Bits 5-0 identifies that the source operand is immediate with the pattern The actual value of the source operand is in the extension word. 15 12 11 9 8 7 6 5 1 1 1 dst src size operation C D 7 6 Hex Binary 1 Binary and hex encoding of the OP-code word are shown.

Instructions (contd..) Consider the instruction: ADD #9, D3
Add 9 to contents of D3 and store result in D3. OP-code word is at location i. Operand 9 is in the extension word at location i+2. After the OP-code word is fetched PC points to i+2. After the extension word is fetched, PC points to i+4. Contents of D3 before execution are 25, and after execution are 34. i D67C OP-code word i + 2 9 Immediate operand i + 4 D3 25 D3 34 PC i PC i + 4 Before After instruction fetch instruction execution

Assembly language Instructions can operate on operands of three sizes, size of the operands must be indicated: Size of the operand is indicated in the operation mnemonic, using L for long-word, W for word, and B for byte. ADD.L indicates ADD operation with operand size long-word. Representation of numbers: Default representation is in the decimal form. Prefix $ is used for hexadecimal, and % for binary. Representation of Alphanumeric characters: Placed within quotes, replaced by the assembler with their ASCII codes. Several characters may be placed in a string between quotes.

Assembler directives ORG: Starting address of an instruction or data block. EQU: Equating names with numerical values DC (Define Constant): Data constants. Size indicator may be appended to specify the size of the data constants. DC.B, DC.W, DC.L. DS (Define Storage): Reserve a block of memory. Names can be associated with memory locations using labels as described earlier.

Condition code flags Five condition code flags, stored in the status register. N Z V C X – Extend flag, set the same way as C flag, but not affected by many instructions. Useful for precision arithmetic. Since operands can be of three sizes (byte, word and long-word): C and X flags depend on the carry-out from bit positions 7, 15 and 31.

Branch instructions Recall that a conditional branch instruction causes program execution to continue with the instruction at the branch target address if the condition is met, else the instruction immediately following the branch is executed. Branch target address is computed by adding an offset to the PC. Branch instructions with two types of offset, 8 and 16 bit. Branch instructions with 8-bit offset: Offset is included in the OP-code word. Branch target address is within –127 to 128 of the value in PC. PC contents are incremented at each fetch, offset defines the distance from the word that follows the branch instruction OP-code word.

Branch instructions (contd..)
Branch instructions with a 16-bit offset: Offset is specified in an extension word following the OP-code word. Branch target address can be in the range of +32K to –32K from the value in PC. Offset is the distance from the extension word to the branch target. 16 conditional branch instructions, each with two offsets. 1 unconditional branch instruction.

Stacks and subroutines
Stack can be implemented using any of the address registers, with the help of autoincrement and autodecrement addressing modes. Register A7 is designated as the processor stack pointer, and it points to the processor stack. Stack used by the processor for all operations that it performs automatically, such as subroutine linkage. Two different stack pointer registers, for two modes of operation: Supervisor mode: All machine instructions can be executed. User mode: Certain privileged instructions cannot be executed. Application programs: user mode. System software: supervisor mode.

Stacks and subroutines (contd..)
Stack may be used for: Subroutine linkage mechanism. Passing parameters to a subroutine and returning result from the subroutine. Branch-to-Subroutine (BSR) is used to call a subroutine: Causes the contents of the PC to be pushed onto the stack. Return from Subroutine (RTS) is used to return from a subroutine: Causes the return address at the top of the stack to be popped into the PC.

Simple programs in 68000 assembly language.
Add the contents of memory locations A and B, and place the result in C. Size of operands A and B is word. Recall that: - For a two-operand instruction, one of the operands has to be placed in a data register D0 through D7. MOVE A, D Move A to register D0. ADD B, D Add the contents of location B to D0 and store the result in D0. MOVE D0, C Transfer the result in register D to location C.

639 10 B = - 215 10 Add the contents of A and B, and place the result in location C. Size of the operands at locations A and B is 16 bits. A, B, and C and the program to add is stored in the memory 201200 OP-code word MOVE A,D0 201202 20 201204 1150 201206 OP-code word 201208 20 ADD B,D0 20120A 1152 20120C OP-code word 20120E 20 MOVE D0,C 201210 2200 C = After execution, [202200] = 424

Simple programs in 68000 assembly language
Program to add A and B, and store the result in C with assembler directives. Memory Addressing address or data lab el Op eration information Assem bler directiv es C EQU $202200 OR G $201150 A DC.W 639 B DC.W – 215 OR G $201200 Statemen ts that MO VE A,D0 generate mac hine ADD B,D0 instructions MO VE D0,C Assembler directiv e END

Add N numbers: - The first number is stored at the starting address NUM1. - The count of entries to be added is stored at address N. - Store the result at location SUM. - Size of each number to be added is Word. MO VE.L N,D1 N con tains n , the n um b er of en tries to b e added, and D1 is used as a coun ter that determines ho w many times to execute the lo op. MO VEA.L #NUM1,A2 A2 is used as a p oin ter to the list en tries. It is initialized to NUM1, the address of the first en try . CLR.L D0 D0 is used to accum ulate the sum. LOOP ADD.W (A2)+,D0 Successiv e n um b ers are added in D0. SUBQ.L #1,D1 Decremen t the coun ter. BGT LOOP If [D1] > 0, execute the lo op again. MO VE.L D0,SUM Store the sum in SUM.

ARM Instruction Set Architecture
Register structure Memory access Addressing modes Instructions Assembly language Subroutines Simple programs in ARM assembly language

Register structure Sixteen 32-bit registers labeled R0 through R15:
15 general purpose registers (R0 through R14) Program Counter (PC) register R15. General purpose registers can hold data operands or memory addresses. Current program status register (CPSR) or Status Register: Condition code flags (N, Z, C, V), Interrupt disable flags. Processor mode bits. 15 additional general purpose registers called banked registers. Duplicates of some of the R0 through R14. Used when the processor switches into the Supervisor mode or Interrupt modes of operation.

31 R0 R1 15 General purpose registers R14 31 R15 (PC) Program counter 31 30 29 28 7 6 4 Status CPSR register N - Negative Z - Zero Processor mode bits C - Carry Interrupt disable bits V- Overflow Condition code flags

Memory access Memory is byte-addressable, using 32-bit addresses.
Two operand lengths are used in moving data between the memory and processor registers: Bytes (8-bits) and Word (32-bits). Word addresses must be aligned: Multiples of 4. Little-endian and big-endian addressing schemes are supported. Determined by an external input control line. When the length of the operand in a data transfer operation is a byte, the byte is stored in the low-order byte position of the register.

Addressing modes Memory is addressed by generating the Effective Address (EA) of the operand by adding a signed offset to the contents of a base register Rn. Pre-indexed mode: EA is the sum of the contents of the base register Rn and an offset value. Pre-indexed with writeback: EA is generated the same way as pre-indexed mode. EA is written back into Rn. Post-indexed mode: EA is the contents of Rn. Offset is then added to this address and the result is written back to Rn.

Relative addressing mode: Program Counter (PC) is used as a base register. Pre-indexed addressing mode with immediate offset No absolute addressing mode available in the ARM processor. Offset is specified as: Immediate value in the instruction itself. Contents of a register specified in the instruction.

Instructions ARM architecture is a RISC architecture
Each instruction is encoded into a 32-bit word. Instruction format 31 28 27 20 19 16 15 12 11 4 3 Condition OP code R n R d Other info R m Instruction specifies a: - Conditional execution code. - OP Code. - Two or three registers (Rn, Rd, and Rm) - Other information. - If Rm is not needed, other information field extends to the last bit.

Instructions (contd..) All instructions are conditionally executed, depending on a condition specified in the condition code of the instruction. Instruction is executed only if the current state of the processor condition code flags satisfies the condition specified in the high-order 4 bits of the instruction. One of the condition codes is used to indicate that an instruction is always executed.

Instructions (contd..) Memory access instructions
Memory is accessed using LOAD and STORE instructions. Mnemonic for LOAD is LDR and STORE is STR. If a byte operand is desired, then mnemonics are LDRB and STRB Recall that the memory is accessed by generating the effective address (EA) of the operand using various addressing modes. Pre-indexed addressing mode (1): - Offset specified as an immediate value - LDR Rd, [Rn,#offset] Pre-indexed addressing mode(2): - Offset magnitude is specified in a register. - LDR Rd, [Rn,+Rm] - Contents of Rm specify the magnitude of the offset. - Rm is preceded by a minus sign if negative offset is desired.

Pre-indexed addressing mode with offset magnitude in a register. LDR R3, [R5, R6] 1000 R5 * Base register * * 200 R6 1000 Offset register * * * * 200 = offset * * 1200 Operand EA = = 1200

Pre-indexed with writeback (1): - Offset is specified as an immediate value. - LDR Rd, [Rn,#offset]! - Exclamation mark indicates writeback, that is the effective address should be written back in Rn. Pre-indexed with writeback (2): - Offset magnitude is specified in a register. - LDR Rd, [Rn, +Rm]! Pre-index with writeback is a generalization of Autodecrement addressing mode.

Pre-indexed mode with writeback. Offset magnitude is specified in a register. LDR R3, [R5, R6]! 1000 R5 * Base register * * 200 R6 1000 Offset register * * * * 200 = offset * * 1200 Operand EA = = 1200 R5 = 1200

Instructions (contd..) Memory access instructions Post-indexed (1):
- Offset is specified as an immediate value. - LDR Rd, [Rn],#offset - Offset is added to Rn after the operand is accessed and the contents are stored in Rn. Post-indexed (2): - Offset magnitude is specified in a register. - LDR Rd,[Rn]+Rm Post-indexed addressing mode always involves writeback. It is a generalization of Autoincrement addressing mode.

Post-indexed addressing. Offset magnitude is specified in a register. LDR R3, [R5] R6 1000 R5 * Base register * * 200 R6 1000 Offset register * * * * 200 = offset * * 1200 Operand EA = 1000 R5 = = 1200

When the offset is given in a register, it may be scaled by a power of 2 by shifting to the right or left. All the addressing modes, pre-indexed, pre-indexed with writeback and post-indexed. LDR R0,[R1,-R2,LSL#4]! Relative mode: - Only the address of a memory location is specified. - LDR R1, ITEM. - This would normally be Absolute addressing mode, but since there is no absolute addressing mode, the EA is computed as the offset of this memory location from the PC. - Operand must be within bytes relative to the updated PC. (PC points to the location following the instruction) In the pre-indexed, pre-indexed with writeback and post-indexed addressing mode, we saw that the magnitude of the offset can either be specified as an immediate operand in the instruction, or it can be specified as the contents of a register. Another variation is possible when the magnitude of the offset is specified as the contents of the register. The offset can either be left-shifted or right-shifted by a power of 2, before it is added to the contents of the base register. In the case of a relative addressing mode, the base register that is used is the program counter. It is not specified explicitly in the instruction. The instruction may be specified as LDR R1, ITEM in the assembly language, where ITEM is the address of the memory location where the operand resides. When the assembler translates this assembly language instruction into machine language instruction, it calculates the value of the offset such that the memory address ITEM would be generated by adding that offset to the contents of the PC. This offset is then stored as an immediate value into the machine language instruction. Note that the PC always points to the memory location subsequent to the memory location where the current instruction is stored. Thus, the value of the offset must be computed from this value of the PC. The memory address that is being accessed must be within + or bytes relative to the updated PC.

Instructions (contd..) Memory access instructions Relative mode
word (4 bytes) address 1000 LDR R1, ITEM 1004 - 1008 - updated [PC] = 1008 * * * * 52 = offset * * ITEM = 1060 Operand Let us consider an example of the relative addressing mode. Suppose the address of the memory location ITEM is When the assembler comes across this instruction, it computes the value of the offset, such that when this offset is added to the contents of the Program Counter, the memory address 1060 is generated. The updated PC should be used in this computation. The updated PC should actually have the value 1004, but because of pipelined execution, the updated PC has the value Since we have to generate the memory address 1060, the value of the offset computed using the updated PC will be 52. PC points to the word after the instruction location. EA =

Block transfer instructions Instructions for loading and storing multiple operands. Any subset of the general purpose registers can be loaded/stored. Mnemonic for Load Multiple is LDM, Store Multiple is STM. Memory operands must be available in successive locations. All forms of pre- and post-indexing with and without writeback can be used. Operate on a base register Rn specified in an instruction. Only word size operands are allowed. Useful in implementing subroutines, when multiple registers need to be stored onto the stack. LDM R10 [R0,R1,R6,R7] R10 is the base register and contains 1000. Transfers the contents of locations 1000, 1004, 1008 and 1012 to registers R0, R1, R6 and R7 respectively.

Instructions (contd..) Register move instructions
Copy the contents of register R0 to register R1. - MOV R1, R0 Load the an immediate operand in the low-order 8 bits of register R0. - MOV R1, #76

Instructions (contd..) Arithmetic instructions
Arithmetic instructions operate on operands given in the general-purpose registers or on immediate operands. Memory operands are not allowed for these instructions (Typical of RISC architectures). OPcode Rd, Rn, Rm - Operation is performed using the operands in registers Rn, Rm. - Result is stored in register Rd. OPcode Rd, Rn, #Operand. - Second operand may also be given in an immediate mode. OPcode Rd, Rn, Rm, LSL #2 - When the second operand is specified in a register, it may also be shifted left or right.

Instructions (contd..) Operand shift instructions
Shifting and rotation operations are performed as separate instructions in most other processors In case of ARM, shifting and rotation operations can be incorporated into most instructions. - Saves code space and may improve execution time performance.

Instructions (contd..) Conditional branch instructions
Contain a signed 2’s complement offset that is added to the updated contents of the PC to generate branch target address. Condition to be tested to determine whether or not branching should take place is specified in the high-order 4 bits of the instruction word. 31 28 27 24 23 1000 BEQ LOCATION Condition OP code Offset 1004 Instruction format updated [PC] = 1008 Offset = 92 LOCATION = 1100 Branch target instruction Note that in general the PC would have pointed to But here it points to 1008 for the reasons of pipelined execution.

Instructions (contd..) Instructions to set condition codes
Conditional branch instructions check the condition code flags in the status register. Condition code flags may be set by arithmetic and logic operations if explicitly specified to do so by a bit in the OP-code. Some instructions are provided for the sole purpose of setting condition code flags.

Assembly language AREA indicates the beginning of a block of memory
Uses the argument CODE or DATA. AREA CODE indicates the beginning of a code block. AREA DATA indicates the beginning of a data block. ENTRY directive indicates that the program is to begin execution at the following instruction. DCD directive is used to label and initialize a data operand. EQU directive is used to equate symbolic names to constants. RN directive is used to assign a symbolic name to a register indicative of its usage in the program.

Subroutines Branch and link (BL) instruction is used to call a subroutine. Operates in the same way as other branch instructions. In addition, stores the return address of the next instruction following the BL instruction into register R14. R14 acts as a link register. For nested subroutines, the contents of the link register may be stored on the stack by the subroutine. Register R13 acts as the stack pointer. Parameters can be passed through registers or on the stack.

Simple programs in ARM assembly language
Add N numbers: - The first number is stored at the starting address NUM1. - The count of numbers to be added is stored at address N. - Store the result at location SUM. - Size of each number to be added is Word. LDR R1,N Load coun t in to R1. R2,POINTER address NUM1 R2. MOV R0,#0 Clear accum ulator R0. LOOP R3,[R2],#4 current number into R3 . ADD R0,R0,R3 Add n umber SUBS R1,R1,#1 Decremen lo op ter BGT Branc h back if not done. STR R0,SUM Store sum.

Simple programs in ARM assembly language
Memory Addressing address or data lab el Operation information Beginning of the code block Assembler directives AREA CODE ENTR Y Begin execution from next instruction. Statements that LDR R1,N generate LDR R2,POINTER mac hine MO V R0,#0 instructions LOOP LDR R3,[R2],#4 ADD R0,R0,R3 SUBS R1,R1,#1 BGT LOOP STR R0,SUM Beginning of the data block Assembler directives AREA D A T A SUM DCD Label & initialize data operands N DCD 5 POINTER DCD NUM1 NUM1 DCD 3,  17,27,  12,322 END

Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules events in the data path which actually causes the instruction to be executed. this is the fetch/execute cycle which is repeated indefinitely.

Fetch/execute cycle Step I:
Fetch the contents of the memory location pointed to by Program Counter (PC). PC points to the memory location which has the instruction to be executed. Load the contents of the memory location into Instruction Register (IR). Step II: Increment the contents of the PC by 4 (assuming the memory is byte addressable and the word length is 32 bits). Step III: Carry out the operation specified by the instructions in the IR. Steps I and II constitute the fetch phase, and are repeated as many times as necessary to fetch the complete instruction. Step III constitutes the execution phase.

Internal organization of a processor
Recall that a processor has several registers/building blocks: Memory address register (MAR) Memory data register (MDR) Program Counter (PC) Instruction Register (IR) General purpose registers R0 - R(n-1) Arithmetic and logic unit (ALU) Control unit. How are these units organized and how do they communicate with each other?

Internal organization of a processor
Internal processor bus Control signals PC Instruction Address decoder and lines MAR control logic Memory bus MDR Data lines IR Y Constant 4 R0 Select MUX Add A B Sub ALU R ( n - 1 ) control ALU lines Carry-in XOR TEMP Z

Single bus organization
ALU, control unit and all the registers are connected via a single common bus. Bus is internal to the processor and should not be confused with the external bus that connects the processor to the memory and I/O devices. Data lines of the external memory bus are connected to the internal processor bus via MDR. Register MDR has two inputs and two outputs. Data may be loaded to (from) MDR from (to) internal processor bus or external memory bus. Address lines of the external memory bus are connected to the internal processor bus via MAR. MAR receives input from the internal processor bus. MAR provides output to external memory bus.

Single bus organization (contd..)
Instruction decoder and control logic block, or control unit issues signals to control the operation of all units inside the processor and for interacting with the memory bus. Control signals depend on the instruction loaded in the Instruction Register (IR) Outputs from the control logic block are connected to: Control lines of the memory bus. ALU, to determine which operation is to be performed. Select input of the multiplexer MUX to select between Register Y and constant 4. Control lines of the registers, to select the registers.

Single bus organization (contd..)
Registers Y, Z, and TEMP: Used by the processor for temporary storage during execution of some instructions. Note that Registers R0 to R(n-1) are used to store data generated by one instruction for later use by another instruction. Data is stored in R0 through R(n-1) after the execution of an instruction. Multiplexer MUX selects either the output of register Y or a constant 4, depending upon the control input Select. Constant 4 is used to increment the value of the PC.

Registers and the bus bus line 0 bus line 1 bus line m-1 (e.g., 31)
bit 0 register bit m-1 clock

Registers and the bus (contd..)
A bus may be viewed as a collection of parallel wires. Buses have no memory: They are just a collection of wires. When data is on the bus, all registers can “see” that data at their inputs. A register may place its contents onto the bus.

At any one time, only one register may output its contents to the bus: Which register outputs its content to the bus is determined by the control signal issued by the control logic. Control signal depends on the instruction loaded in the instruction register. Registers can load data from the bus: Which registers load data from the bus is determined by the control signal issued by the control logic. Registers are clocked (sequential) entities (unlike ALU which is purely combinatorial).

Each register Ri has two control signals, Riin and Riout.
If Riin=1, the data from the bus is loaded into the register. If Riout=1, the data from the register is loaded onto the bus. The same holds for registers Y and Z as well. R i R i out Y in Y Constant 4 Select MUX A B ALU Z in Z Z out

D Q 1 R i out Q Clock R i in Each bit in a register may be implemented by an edge-triggered D flip flop. Two input multiplexer is used to select the data applied to the input of an edge triggered flip-flop. Q output of the flip-flop is connected to the bus via a tri-state gate.

D Q 1 R i out Q Clock R i in Riin = 1: Multiplexer selects the data on the bus. Data is loaded into the flip-flop at the rising edge of the clock. Riin = 0: Multiplexer feeds back the value currently stored in the flip-flop. Q output represents the value currently stored in the flip-flop.

D Q 1 R i out Q Clock R i in Riout = 1: Tri-state gate loads the value of the flip-flop onto the bus. Data is loaded onto the bus at the rising edge of the clock. Riout = 0: Gate’s output is in high-impedance (electrically disconnected) state. Corresponds to open-circuit state.

Operation of a tri-state gate A tri-state gate can enter one of three output states. - its output can be in a logic low state (L). - its output can be in a logic high state (H). - its output can be effectively an open-circuit (high impedance) When a tri-state gate is connected to a bus in high-impedance state, its outputs are effectively disconnected from the bus. Riout = 1, output is: Logic low, if Q = 0 Logic high, if Q = 1 Riout = 0: High impedance Open circuit condition D Q Clock 1 Bus R i in

Operation of an edge-triggered flip-flop single processor clock period Low-to-High transition Data is loaded from the register to the bus (or to the register from the bus) at the rising edge of the clock. Data is loaded at the L-H transition of the clock.

Data transfers and operations take place within time periods defined by the processor clock. Time period is known as the clock cycle. At the beginning of the clock cycle, the control signals that govern a particular transfer are asserted. For e.g., if the data are to be transferred from register R0 to the bus, then R0out is set to 1. Edge-triggered flip-flop operation explained earlier used only the rising edge of the clock for data transfer. Other schemes are possible, for example, data transfers may use rising and falling edges of the clock. When edge-triggered flip-flops are not used, two or more clock signals may be needed to guarantee proper transfer of data. This is known as multiphase clocking.

Simple register transfer example
Transfer the contents of register R3 to register R4 Clock period 1 2 3 1. Control signals R3out and R4in become 1. They stay valid until the end of the clock cycle. 2.After a small delay, the contents of R3 are placed onto the bus. The contents of R3 stay onto the bus until the end of the clock cycle. 3. At the end of the clock cycle, the data onto the bus is loaded into R4. R3out and R4in become 0.

Loading multiple registers from the bus
Transfer the contents of register R3 to register R4, R5 Clock period 1 2 3 1. Control signals R3out, R4in and R5in become 1. They stay valid until the end of the clock cycle. 2.After a small delay, the contents of R3 are placed onto the bus. The contents of R3 stay onto the bus until the end of the clock cycle. 3. At the end of the clock cycle, the data onto the bus is loaded into R4. and R5. R3out, R4in and R5in become 0.

Loading multiple registers from the bus (contd..)
It is possible to load multiple registers simultaneously from the bus. For e.g., transfer the contents of register R3 to registers R4 and R7 simultaneously. The number of registers that can be simultaneously loaded depends on: Drive capability (fan-out) Noise. Note that this is an electrical issue, not a logical issue. Distinguish this from multiple registers loading the bus: For e.g. load the contents of registers R3 and R4 onto the bus simultaneously. Logically inconsistent event. Physically dangerous event.

Recap the last class.

Arithmetic Logic Unit (ALU)
ALU is a purely combinatorial device: It has no memory or internal storage. It has 2 input vectors: These may be called the A- and B-vector or the R- and S-vector The inputs are as wide as the registers/system bus (e.g., 16, 32 bits) It has 1 output vector Usually denoted F

Arithmetic Logic Unit (ALU) (contd..)
Sample functions performed by the ALU F = A+B F = A+B+1 F = A-B F = A-B-1 F = A and B F = A or B F = not A F = not B F = not A + 1 F = not B + 1 F = (not A) and B F = A and (not B) F = A xor B F = not (A xor B) F = A F = B

Arithmetic Logic Unit (ALU) (contd..)
ALU is basically a black-box ALU Add XOR Sub control lines A B A, B Inputs Output F purely combinatorial logic(AND/OR/ NOT/NAND/NOR etc) inside the --no registers Carry in

Arithmetic and Logic Unit (ALU) (contd..)
ALU connections to the bus Y ALU A B MUX Constant 4 ALU must have only one input connection from the bus. The other input must be stored in a holding register called Y register. A multiplexer selects among register Y and 4 depending upon select line. One operand of a two-operand instruction must be placed into the Y register before the other operand must be placed onto the bus. Select Control lines Carry-in Processor bus

Arithmetic and Logic Unit (ALU) (contd..)
ALU connections to the bus Identical reasoning tells us that there must be an output register Z which collects the output of the ALU at the end of each cycle. This way, there can be --one operand in the Y register --one operand ON THE BUS --the result stored in the Z register Y Constant 4 Select MUX A B Control lines ALU Carry-in Z Processor bus

Performing an arithmetic operation
Add the contents of registers R1 and R2 and place the result in R3. That is: R3 = R1 + R2 1. Place the contents of register R1 into the Y register in the first clock cycle. 2. Place the contents of register R2 onto the bus in the second clock cycle. Both inputs to the ALU are now valid. Select register Y, and assert the ALU command F=A+B. 3. In the third clock cycle, Z register has latched the output of the ALU. Thus the contents of the Z register can be copied into register R3.

Performing an arithmetic operation (contd..)
Instruction Control Signals Z bus Memory Address lines MAR Data MDR PC decoder and control logic Clock cycle 1: R1out, Yin IR Y Constant 4 R1 R 2 Select MUX R 3 Add A B Sub R n 1 - ( ) ALU control ALU lines Carry-in XOR TEMP

Instruction Control Signals Z bus Memory Address lines MAR Data MDR PC decoder and control logic Clock cycle 2: R2out, SelectY, Add, Zin IR Y Constant 4 R1 R 2 SelectY MUX R 3 Add=1 A B Sub R n 1 - ( ) ALU control ALU lines Carry-in XOR TEMP

Instruction Control Signals Z bus Memory Address lines MAR Data MDR PC decoder and control logic Clock cycle 3: Zout, R3in IR Y Clock cycle 4: R3 has the sum. Constant 4 R1 R 2 SelectY MUX R 3 Add A B Sub R n 1 - ( ) ALU control ALU lines Carry-in XOR TEMP

Clock Cycle 1: R1out, Yin (Y=R1) Clock Cycle 2: R2out, SelectY, Add, Zin (Z = R1+R2) Clock Cycle 3: Zout, R3in (R3=Z)

Inputs of the ALU: Input B is tied to the bus. Input A is tied to the output of the multiplexer. Output of the ALU: Tied to the input of the Z register. Z register: Input tied to the output of the ALU. Output tied to the bus. Unlike Riin, Zin loads data from the output of the ALU and not the bus.

Events as seen by the system registers, bus, ALU over time. cycle: controls active what the bus sees what Y has output of ALU start of 1 R1out, Yin contents of R1 --unknown --unknown end of 1 R1out, Yin contents of R1 R1 --unknown = 2 R2out, F=A+B contents of R2 R1 F=A+B=R1+R2 but this is not valid yet 2 R2out, F=A+B,Zin contents of R2 R1 F=A+B=R1+R2 (now valid) 3 Zout,R3in contents of Z R1 --unknown 3 Zout, R3in contents of Z R1 --unknown (but R3 latches bus contents)

ALU operations RC = RA op RB Clock cycle 1: Move RA to Y register.
RAout, Yin Clock cycle 2: Put RB on the bus, perform F = RA op RB, and transfer the result to Z. RBout, (RA op RB)=1, SelectY, Zin Clock cycle 3: Put Z on the bus, and load Z into RC. Zout, RCin

Fetching a word from memory
Processor has to specify the address of the memory location where this information is stored and request a Read operation. Processor transfers the required address to MAR. Output of MAR is connected to the address lines of the memory bus. Processor uses the control lines of the memory bus to indicate that a Read operation is needed. Requested information are received from the memory and are stored in MDR. Transferred from MDR to other registers.

Fetching a word from memory (contd..)
Connections for register MDR MDR Memory-b us data lines b out outE in inE MDRoutE and MDRinE control connection to external bus. MDRout and MDRin control connection to internal bus.

Timing of the internal processor operations must be coordinated with the response time of memory Read operations. Processor completes one internal data transfer in one clock cycle. Memory response time for a Read operation is variable and usually longer than one clock cycle. Processor waits until it receives an indication that the requested Read has been completed. Control signal (MFC) is used for this purpose. MFC is set to 1 by the memory to indicate that the contents of the specified location have been read and are available on the data lines of the memory bus.

MOVE (R1) R2 1. Load the contents of Register R1 into MAR. 2. Start a Read operation on the memory bus. 3. Wait for MFC response from the memory. 4. Load MDR from the memory bus. 5. Load the contents of MDR into Register R2. Steps can be performed separately, some may be combined. Steps 1 and 2 can be combined. - Load R1 to MAR and activate Read control signal simultaneously. Steps 3 and 4 can be combined. - Activate control signal MDRinE while waiting for response from the memory. Last step loads the contents of MDR into Register R2. Memory Read operation takes 3 steps.

MOVE (R1) R2: Memory operation takes 3 steps. Step 1: - Place R1 onto the internal processor bus. - Load the contents of the bus into MAR. - Activate the Read control signal. - R1out, MARin, Read. Step 2: - Wait for MFC from the memory. - Activate the control signal to load data from external bus to MDR. - MDRinE, WMFC Step 3: - Place the contents of MDR onto the internal processor bus. - Load the contents of the bus into Register R2. - MDRoutI, R2in

Storing a word into memory
MOVE R2, (R1): Memory operation takes 3 steps. Step 1: - Place R1 onto the internal processor bus. - Load the contents of the bus into MAR. - R1out, MARin, Write. Step 2: - Place R2 onto the internal processor bus. - Load the contents of the internal processor bus into MDR. - Activate Write operation. R2out, MDRin, Write Step 3: - Place the contents of MDR into the external memory bus. - Wait for the memory write operation to be complete. - MDRoutE, WMFC

Execution of a complete instruction
Add the contents of a memory location pointed to by Register R3 to register R1. ADD (R3) R1 To execute the instruction we must execute the following tasks: 1. Fetch the instruction. 2. Fetch the operand (contents of the memory location pointed to by R3.) 3. Perform the addition. 4. Load the result into R1.

Execution of a complete instruction
Task 1: Fetch the instruction Recall that: - PC holds the address of the memory location which has the next instruction to be executed. - IR holds the instruction currently being executed. Step 1 - Load the contents of PC to MAR. - Activate the Read control signal. - Increment the contents of the PC by 4. - PCout, MARin, Read, Select4, Add, Zin. Step 2 - Update the contents of the PC. - Copy the updated PC to Register Y (useful for Branch instructions). - Wait for MFC from memory. - Zout, PCin, Yin, WMFC Step 3 - Place the contents of MDR onto the bus. - Load the IR with the contents of the bus. - MDRout, IRin

Execution of a complete instruction (contd..)
Task 2. Fetch the operand (contents of memory pointed to by R3.) Task 3. Perform the addition. Task 4. Load the result into R1. Step 4: - Place the contents of Register R3 onto internal processor bus. - Load the contents of the bus onto MAR. - Activate the Read control signal. - R3out, MARin, Read Step 5: - Place the contents of R1 onto the bus. - Load the contents of the bus into Register Y (Recall one operand in Y). - Wait for MFC. - R1out, Yin, WMFC Step 6: - Load the contents of MDR onto the internal processor bus. - Select Y, and perform the addition. - Place the result in Z. - MDRout, SelectY, Add, Zin. Step 7: - Place the contents of Register Z onto the internal processor bus. - Place the contents of the bus into Register R1. - Zout, Rin

Execution of a complete instruction (contd..)
Step Action 1 PC , MAR , Read, Select4, Add, Z out in in 2 Z , PC , Y , WMF C out in in 3 MDR , IR out in 4 R3 , MAR , Read out in 5 R1 , Y , WMF C out in 6 MDR , SelectY, Add, Z out in 7 Z , R1 , End out in

Branch instructions Recall that the updated contents of the PC are copied into Register Y in Step 2. Not necessary for ADD instruction, but useful in BRANCH instructions.: Branch target address is computed by adding the updated contents of the PC to an offset. Copying the updated contents of the PC to Register Y speeds up the execution of BRANCH instruction. Since the Fetch cycle is the same for all instructions, this step is performed for all instructions. Since Register Y is not used for any other purpose at that time it does not have any impact on the execution of the instruction.

Multiple-bus organization
Simple single-bus structure: Results in long control sequences, because only one data item can be transferred over the bus in a clock cycle. Most commercial processors provide multiple internal paths to enable several transfers to take place in parallel. Multiple-bus organization.

Multiple bus organization (contd..)
Bus A Bus B Incrementer Bus C PC Re gister file Constant 4 MUX A ALU R B Instruction decoder IR Memory data lines MDR Memory address lines MAR

Three-bus organization to connect the registers and the ALU of a processor. All general-purpose registers are combined into a single block called register file. Register file has three ports. Two outputs ports connected to buses A and B, allowing the contents of two different registers to be accessed simultaneously, and placed on buses A and B. Third input port allows the data on bus C to be loaded into a third register during the same clock cycle. Inputs to the ALU and outputs from the ALU: Buses A and B are used to transfer the source operands to the A and B inputs of the ALU. Result is transferred to the destination over bus C.

ALU can also pass one of its two input operands unmodified if needed: Control signals for such an operation are R=A or R=B. Three bus arrangement obviates the need for Registers Y and Z in the single bus organization. Incrementer unit: Used to increment the PC by 4. Source for the constant 4 at the ALU multiplexer can be used to increment other addresses such as the memory addresses in multiple load/store instructions.

Three operand instruction: ADD R4, R5, R6 Step Action 1 PC , R=B, MAR , Read, IncPC out in 2 WMF C 3 MDR , R=B, IR outB in 4 R4 , R5 , SelectA, Add, R6 , End outA outB in 1. Pass the contents of the PC through ALU and load it into MAR. Increment PC. 2. Wait for MFC. 3. Load the data received into MAR and transfer to IR. 4. Execution of the instruction is the last step.

Control unit To execute instructions the processor must generate the necessary control signals in proper sequence. Hardwired control: Control unit is designed as a finite state machine. Inflexible but fast. Appropriate for simpler machines (e.g. RISC machines) Microprogrammed control: Control path is designed hierarchically using principles identical to the CPU design. Flexible, but slow. Appropriate for complex machines (e.g. CISC machines)

Hardwired control Step Action 1 PC out , MAR in Read, Select4, Add, Z 2 Y WMF C 3 MDR IR 4 R3 Read 5 R1 6 SelectY, 7 End Each step in this sequence is completed in one clock cycle. A counter may be used to keep track of the control steps. Each state or count, of this counter corresponds to one control step.

Hardwired control (contd..)
Required control signals are determined by the following information: Contents of the control step counter. Determines which step in the sequence. Contents of the instruction register. Determines the actual instruction Contents of the condition code flags. Used for example in a BRANCH instruction. External input signals such as MFC.

Control unit organization CLK Control step Clock counter Control unit consists of a decoder/encoder block to accept the following inputs: - Control step counter. - Instruction Register. - Condition codes - External inputs. Generates control signals. External inputs Decoder/ IR encoder Condition codes Control signals

External inputs Encoder Reset CLK Clock Control signals counter Run End Condition codes decoder Instruction Step decoder Control step IR T 1 2 n INS m Separate signal line for each step Separate signal line for each instruction Combines inputs to generate control signals

Control signals such as Zin, PCout, ADD are generated by encoder block Suppose if Zin is asserted: - During T1 for all instructions. - During T6 for ADD instruction. - During T4 for unconditional BRANCH instruction BRANCH ADD T T 4 6 T 1

Control hardware can be viewed as a state machine: Changes state every clock cycle depending on the contents of the instruction register, condition codes, and external inputs. Outputs of the state machine are control signals: Sequence of control signals generated by the machine is determined by wiring of logic elements, hence the name “hardwired control”. Speed of operation is one of the advantages of hardwired control is its speed of operation. Disadvantages include: Little flexibility. Limited complexity of the instruction set it can implement.

Microprogrammed control
Hardwired control generates control signals using: A control step counter. Decoder/encode circuit. Microprogrammed control: Control signals are generated by a program similar to machine language programs.

Microprogrammed control (contd..)
Control Word (CW) is a word whose individual bits represent various control signals. Control Signals: PCout PCin MARin Read MDRout IRin Yin SelectY Select4 Add Zin Zout R1out R1in R3out WMFC End Step Action 1 PC out , MAR in Read, Select4, Add, Z 2 Y WMF C 3 MDR IR 4 R3 Read 5 R1 6 SelectY, 7 End At every step, some control signals are asserted (=1) and all others are 0.

in out Micro - in out out out WMFC instruction MAR Read MDR in in Select in PC PC Add IR in out End Y Z Z R1 R1 R3 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 4 1 1 1 5 1 1 1 6 1 1 1 7 1 1 1 At every step,a Control Word needs to be generated. Every instruction will need a sequence of CWs for its execution. Sequence of CWs for an instruction is the microroutine for the instruction. Each CW in this microroutine is referred to as a microinstruction. (SelectY is represented by Select=0, & Select4 by Select=1)

Every instruction will have its own microroutine which is made up of microinstructions. Microroutine for all instructions in the instruction set of a computer are stored in a special memory called Control Store. Recall that the Control Unit generates the control signals: Sequentially reading the CWs of the corresponding microroutine from the control store.

Basic organization of a microprogrammed control unit. Microprogram counter (mPC) is used to read CWs from control store sequentially. When a new instruction is loaded into IR, Starting address generator generates the starting address of the microroutine. This address is loaded into the mPC. mPC is automatically incremented by the clock, so successive microinstructions are read from the control store. Starting IR address generator Clock m P C Control CW store

Basic organization of the microprogrammed control unit cannot check the status of condition codes or external inputs to determine what should be the next microinstruction. Recall that in the hardwired control, this was handled by an appropriate logic function. How to handle this in microprogrammed control: Use conditional branch microinstructions. These microinstructions, in addition to the branch address also specify which of the external inputs, condition codes or possibly registers should be checked as a condition for branching.

Address Microinstruction PC out , MAR in Read, Select4, Add, Z 1 Y WMF C 2 MDR IR 3 Branch to starting address of appropriate microroutine . 25 If N=0, then branch microinstruction 26 Offset-field-of-IR SelectY, 27 End Fetch BRANCH<0 instruction, microroutine is at address 25. Branch to address 25. Test the N bit of the condition codes If 0, go to 0 and get new instr. Else execute microinstruction located at 26 and put the branch target address into Register Z. (Microinstruction at location 27). Address 25 is the output of starting address generator and is loaded into the microprogram counter (mPC).

Control Unit with Conditional Branching in the Microprogram External inputs Starting and Condition IR branch address codes generator Starting and branch address generator instead of starting address generator. Clock m P C Control CW store

Starting and branch address generator accepts as inputs: Contents of the Instruction Register (as before). External inputs Condition codes Generates a new address and loads it into microprogram counter (mPC) when a microinstruction instructs it do so. mPC is incremented every time a microinstruction is fetched except: New instruction is loaded into IR, mPC is loaded with the starting address of the microroutine for that instruction. Branch instruction is encountered and branch condition is satisfied, mPC is loaded with the branch address. End instruction is encountered, mPC is loaded with the address of the first CW in the microroutine for the instruction fetch cycle.

Microinstruction format Simple approach is to allocate one bit for each control signal - Results in long microinstructions, since the number of control signals is usually very large. - Few bits are set to 1 in any microinstruction, resulting in a poor use of bit space. Reduce the length of the microinstruction by taking advantage of the fact that most signals are not needed simultaneously, and many signals are mutually exclusive. For example: - Only one ALU function is active at a time. - Source for a data transfer must be unique. - Read and Write memory signals cannot be active simultaneously.

Microinstruction format Group mutually exclusive signals in the same group. At most one microperation can be specified per group. Use binary coding scheme to represent signals within a group. Examples: If ALU has 16 operations, then 4 bits can be sufficient. Group register output signals into the same group, since only one of these signals will be active at any given time (Why?) If the CPU has 4 general purpose registers, then PCout, MDRout, Zout, Offsetout, R0out, R1out, R2out, R3out and Tempout can be placed in a single group, and 4 bits will be needed to represent these.

Microinstruction F1 F2 F3 F4 F5 F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits) 0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action 0001: PC 001: PC 001: MAR 0001: Sub 01: Read out in in Each group occupies a large enough field to represent all the signals. Most fields must include one inactive code, which specifies no action. All fields do not have to include inactive code. 0010: MDR 010: IR 010: MDR 10: Write out in in 0011: Z 011: Z 011: TEMP out in in 0100: R0 100: R0 100: Y 1111: XOR out in in 0101: R1 101: R1 out in 0110: R2 110: R2 16 ALU out in functions 0111: R3 111: R3 out in 1010: TEMP out 1011: Offset out F6 F7 F8 F6 (1 bit) F7 (1 bit) F8 (1 bit) 0: SelectY 0: No action 0: Continue 1: Select4 1: WMFC 1: End

Microinstruction format Grouping control signals into fields requires a little more hardware: - Decoding circuits are needed to decode patterns to individual control signals. Cost of the additional hardware is offset by reduced number of bits in each microinstruction: - Reduces the size of the control store.

Microinstruction format Alternative approach to reduce the length of the microinstruction: Enumerate the patterns of the required signals in all microinstructions. Assign each meaningful combination of active control signals a unique code. Code represents the microinstruction. Full encoding can reduce the length of the microinstruction even further, but this reduction comes at the expense of increasing the complexity of the decoder circuits.

Microinstruction format Vertical organization: Highly encoded schemes. Use compact codes to represent only a small number of control functions in each microinstruction. Slow operating speed, more microinstructions to generate the required control functions. Horizontal organization: Minimally encoded schemes. Many control signals can be specified in each microinstruction. Higher operating speeds, useful when the machine allows parallel control of resources. Intermediate schemes, where degree of encoding is a design parameter

Microprogram sequencing Simple microprogram sequencing: Load the starting address into mPC when a new instruction is loaded into IR. Introduce some branching capability within the microprogram through special branch microinstructions, which specify the branch address. Disadvantages of the simple approach: Large total number of microinstructions and large control store. Most machines have several addressing modes, and many combinations of instructions and addressing modes. Separate microroutine for each of these combinations produces a lot of duplication of common parts. Share as much common code as possible. Sharing common code requires many branch instructions to transfer control among various parts. Execution time is longer because it takes more time to carry out the required branches.

Microprogram sequencing Consider the following instruction which adds the source operand to the contents of register Rdst, and places the results in register Rdst. ADD src, Rdst Assume that the source operand can be specified in the following addressing modes: - Register. - Autoincrement - Autodecrement - Indexed. - Indirect forms of the above methods.

Start of instruction fetch. Microroutines for all the instructions Microinstruction at address 170 is performed in the register indirect mode. Microinstruction at address 171 is performed in the register direct mode. Microinstruction 170 is bypassed if register direct mode is used. One way to bypass microinstruction 170 is to have the previous microinstruction specify address 170, and then use an OR gate to change the LSB of the address to 1. This technique is called bit-ORing.

Microinstructions with the next-address field. Several branch microinstructions are required to enable sharing of common code. The branch microinstructions do not perform any useful operation related to data. They are required to determine the address of the next microinstruction. They slow down the execution of the instruction. Ideally we need to assign consecutive addresses to all microinstructions that are generally executed one after the other. Recall that the next microinstruction is determined by incrementing the microprogram counter. But due to the goal of sharing as much common code as possible, this is not always possible. This leads to a significant increase in the branch instructions.

Microinstructions with the next-address field. Powerful alternative is to include an address field as a part of every microinstruction. The address field indicates the location of the next microinstruction to be fetched. In effect, every microinstruction becomes a branch microinstruction in addition to its other function. Disadvantages: Additional bits are required to specify the address field in every instruction. Approximately one-sixth of the control store is devoted to specifying the address. Advantages: Separate branch instructions are virtually eliminated. Flexible scheme, very few restrictions in assigning addresses to microinstructions.

Microprogrammed control leads to slower operating speed because of the time it takes to fetch microinstructions from the control store. To achieve faster operation, the next microinstruction can be prefetched while the current one is being executed. Execution time can be overlapped with the fetch time. Prefetching microinstructions presents some difficulties: Status flags and the results of the current microinstruction that is being executed are necessary to determine the address of the next microinstruction. Straightforward prefetching may occassionally fetch a wrong instruction. Fetch must be repeated. Disadvantages/difficulties more than balance the increased operation speed.

Emulation Microprogrammed control offers the flexibility to add new instructions to the instruction set of a processor. New microroutines need to be added to implement the new instructions. Add to the instruction set of a given computer M1 an entirely new set of instructions that is in fact the instruction set of a different computer M2. Programs written in the machine language of M2 can be run on M1. M1 emulates M2. Emulation allows transition to new computer systems with minimal disruption.

Number representation
Integers are represented as binary vectors Suppose each word consists of 32 bits, labeled 0…31. MSB (most significant bit) LSB (least) Value of the binary vector interpreted as unsigned integer is: V(b) = b b b b b0.20 More generally in N bits,

Number representation (contd..)
We need to represent both positive and negative integers. Three schemes are available for representing both positive and negative integers: Sign and magnitude. 1’s complement. 2’s complement. All schemes use the Most Significant Bit (MSB) to carry the sign information: If MSB = 0, bit vector represents a positive integer. If MSB = 1, bit vector represents a negative integer.

Sign and Magnitude: Lower N-1 bits represent the magnitude of the integer MSB is set to 0 or 1 to indicate positive or negative 1’s complement: Construct the corresponding positive integer (MSB = 0) Bitwise complement this integer 2’s complement: Construct the 1’s complement negative integer Add 1 to this

V alues represented Sign and b b b b magnitude 1' s complement 2' s complement 3 2 1 1 1 1 + 7 + 7 + 7 1 1 + 6 + 6 + 6 1 1 + 5 + 5 + 5 1 + 4 + 4 + 4 1 1 + 3 + 3 + 3 1 + 2 + 2 + 2 1 + 1 + 1 + 1 + + + 1 - - 7 - 8 1 1 - 1 - 6 - 7 1 1 - 2 - 5 - 6 1 1 1 - 3 - 4 - 5 1 1 - 4 - 3 - 4 1 1 1 - 5 - 2 - 3 1 1 1 - 6 - 1 - 2 1 1 1 1 - 7 - - 1

Range of numbers that can be represented in N bits Unsigned: Sign and magnitude: One’s complement:: Two’s complement:: 0 has both positive and negative representation 0 has both positive and negative representation 0 has a single representation, easier to add/subtract.

Value of a bit string in 2’s complement
How to determine the value of an integer given: Integer occupies N bits. 2’s complement system is in effect. Binary vector b represents a negative integer, what is V(b). Write b = 1 bn-2 bn-2 bn b1 b0 Then V(b) = -2n-1 + bn-2 2n-2 + bn-3 2n b b b0 20 (v(b) = -2n-1 + bn-2 2n-2 + bn-3 2n b b b0 20 showing negative and positive parts of the expression ) So, in 4 bits, 1011 is v(1011) = = -5

Addition of positive numbers
Add two one-bit numbers 1 1 + + + 1 + 1 1 1 1 Carry-out To add multiple bit numbers: Add bit pairs starting from the low-order or LSB (right end of bit vector) Propagate carries towards the high-order or MSB (left end of bit vector)

Addition and subtraction of signed numbers
We need to add and subtract both positive and negative numbers. Recall the three schemes of number representation. Sign-and-magnitude scheme is the simplest representation, but it is the most awkward for addition and subtraction operations. 2’s complement if the most efficient method for performing addition and subtraction of signed numbers.

Addition and subtraction of signed numbers (contd..)
Consider addition modulo 16 operation Example: (7+4) modulo 16: -Locate 7 on the circle. -Move 4 units in the clockwise direction. -Arrive at the answer 11. Example: (9+14) modulo 16: -Locate 9 on the circle. -Move 14 units in the clockwise direction. -Arrive at the answer 7. 15 1 14 2 3 13 12 4 11 5 10 6 9 7 8

Addition and subtraction of signed numbers (contd..)
Different interpretation of mod 16 circle: - Values 0 through 15 represented by 4-bit binary vectors. - Reinterpret the binary numbers from –8 through 7 in 2’s complement method. Example: Add +7 to –3 -2’s complement of +7 is 0111. -2’s complement of –3 is 1101. -Locate 0111 on the circle. -Move 1101 (13) steps clockwise. -Arrive at 4(0100) which is the answer. 0000 1111 0001 1110 0010 - 1 + 1 - 2 + 2 1101 0011 - 3 + 3 1100 - 4 + 4 0100 - 5 + 5 1011 0101 - 6 + 6 - 7 + 7 - 8 1010 0110 1001 0111 1000 Ignore carry out

Rules for addition and subtraction of signed numbers in 2’s complement form
To add two numbers: Add their n-bit representations. Ignore the carry out from MSB position. Sum is the algebraically correct value in the 2’s complement representation as long as the answer is in the range –2n-1 through +2n-1 –1 . To subtract two numbers X and Y (X-Y): Form the 2’s complement of Y. Add it to X using Rule 1. Result is correct as long as the answer lies in the range –2n-1 through +2n-1 –1 .

Overflow in integer arithemtic
When the result of an arithmetic operation is outside the representable range an arithmetic overflow has occurred. Range is –2n-1 through +2n-1 –1 for n-bit vector. When adding unsigned numbers, carry-out from the MSB position serves as the overflow indicator. When adding signed numbers, this does not work. Using 4-bit signed numbers, add +7 and +4: Result is 1011, which represents –5. Using 4-bit signed integers, add –4 and –6: Result is 0110, which represents +6.

Overflow in integer arithmetic (contd..)
Overflow occurs when both the numbers have the same sign. Addition of numbers with different signs cannot cause an overflow. Carry-out signal from the MSB (sign-bit) position is not a sufficient indicator of overflow when adding signed numbers. Detect overflow when adding X and Y: Examine the signs of X and Y. Examine the signs of the result S. When X and Y have the same sign, and the sign of the result differs from the signs of X and Y, overflow has occurred.

Addition/subtraction of signed numbers
x y Carry-in c Sum s Carry-out c At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state i i i i i +1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 s = x y c + x y c + x y c + x y c = x i Å y i Å c i i i i i i i i i i i i i i c = y c + x c + x y i +1 i i i i i i E xample: X 7 1 1 1 x Carry-out i Carry-in + Y = + 6 = + 1 1 y 1 1 c i c i +1 i Z 13 1 1 1 s i Legend for stage i

Addition logic for a single stage
Sum Carry y i x c s 1 + c Full adder c i + 1 (F A) i s i Full Adder (FA): Symbol for the complete circuit for a single stage of addition.

n-bit adder Cascade n full adder (FA) blocks to form a n-bit adder.
Carries propagate or ripple through this cascade, n-bit ripple carry adder. F A c y 1 x s n - Most significant bit (MSB) position Least significant bit (LSB) position Carry-in c0 into the LSB position provides a convenient way to perform subtraction.

kn-bit adder kn-bit numbers can be added by cascading k n-bit adders.
x y x y x y x y x y k n - 1 k n - 1 2 n - 1 2 n - 1 n n n - 1 n - 1 c n - bit n - bit n n - bit c c k n adder adder adder s s s s s s k n - 1 ( k - 1 ) n 2 n - 1 n n - 1 Each n-bit adder forms a block, so this is cascading of blocks. Carries ripple or propagate through blocks, Blocked Ripple Carry Adder

n-bit subtractor Recall X – Y is equivalent to adding 2’s complement of Y to X. 2’s complement is equivalent to 1’s complement + 1. X – Y = X + Y + 1 2’s complement of positive and negative numbers is computed similarly. F A 1 y x s c n - Most significant bit (MSB) position Least significant bit (LSB) position

n-bit adder/subtractor
F A c y 1 x s n - Most significant bit (MSB) position Least significant bit (LSB) position Adder inputs: xi, yi, co=0 Subtractor inputs: xi, yi, co=1 x y x y x y n - 1 n - 1 1 1 c c n - 1 1 c n F A F A F A 1 s s s n - 1 1 Most significant bit Least significant bit (MSB) position (LSB) position

n-bit adder/subtractor (contd..)
y y y n - 1 1 Add/Sub control x x x n - 1 1 c n -bit adder n c s s s n - 1 1 Add/sub control = 0, addition. Add/sub control = 1, subtraction.

Detecting overflows Overflows can only occur when the sign of the two operands is the same. Overflow occurs if the sign of the result is different from the sign of the operands. Recall that the MSB represents the sign. xn-1, yn-1, sn-1 represent the sign of operand x, operand y and result s respectively. Circuit to detect overflow can be implemented by the following logic expressions:

Computing the add time x0 y0 Consider 0th stage:
FA Consider 0th stage: c1 is available after 2 gate delays. s1 is available after 1 gate delay. Sum Carry y i c x i i x y s i c i i c i + 1 i c i x i y i

Computing the add time (contd..)
Cascade of 4 Full Adders, or a 4-bit adder x0 y0 s2 FA s1 c2 s0 c1 c3 c0 s3 c4 s0 available after 1 gate delays, c1 available after 2 gate delays. s1 available after 3 gate delays, c2 available after 4 gate delays. s2 available after 5 gate delays, c3 available after 6 gate delays. s3 available after 7 gate delays, c4 available after 8 gate delays. For an n-bit adder, sn-1 is available after 2n-1 gate delays cn is available after 2n gate delays.

Fast addition Recall the equations: Second equation can be written as:
We can write: Gi is called generate function and Pi is called propagate function Gi and Pi are computed only from xi and yi and not ci, thus they can be computed in one gate delay after X and Y are applied to the inputs of an n-bit adder.

Carry lookahead All carries can be obtained 3 gate delays after X, Y and c0 are applied. -One gate delay for Pi and Gi -Two gate delays in the AND-OR circuit for ci+1 All sums can be obtained 1 gate delay after the carries are computed. Independent of n, n-bit addition requires only 4 gate delays. This is called Carry Lookahead adder.

Carry-lookahead adder
x y x y x y 3 3 2 2 1 1 x y . c c c c 3 2 1 4 B cell B cell B cell B cell c 4-bit carry-lookahead adder. s s s 3 2 1 s G P G P G P G P 3 3 2 2 1 1 Carry-lookahead logic x y . i i . B-cell for a single stage. . c i B cell G P s i i i

Carry lookahead adder (contd..)
Performing n-bit addition in 4 gate delays independent of n is good only theoretically because of fan-in constraints. Last AND gate and OR gate require a fan-in of (n+1) for a n-bit adder. For a 4-bit adder (n=4) fan-in of 5 is required. Practical limit for most gates. In order to add operands longer than 4 bits, we can cascade 4-bit Carry-Lookahead adders. Cascade of Carry-Lookahead adders is called Blocked Carry-Lookahead adder. c i + 1 = G P - 2 .. ... c0

Blocked Carry-Lookahead adder
Carry-out from a 4-bit block can be given as: Rewrite this as: Subscript I denotes the blocked carry lookahead and identifies the block. Cascade 4 4-bit adders, c16 can be expressed as:

Blocked Carry-Lookahead adder (contd..)
Carry-lookahead logic 4-bit adder s 15-12 P 3 I G c 12 2 8 11-8 1 4 7-4 3-0 16 x y . After xi, yi and c0 are applied as inputs: - Gi and Pi for each stage are available after 1 gate delay. - PI is available after 2 and GI after 3 gate delays. - All carries are available after 5 gate delays. - c16 is available after 5 gate delays. - s15 which depends on c12 is available after 8 (5+3)gate delays (Recall that for a 4-bit carry lookahead adder, the last sum bit is available 3 gate delays after all inputs are available)

Multiplication of unsigned numbers
1 1 1 (13) Multiplicand M 1 1 1 (11) Multiplier Q 1 1 Partial product (PP) #1 1 1 Partial product (PP) #2 Partial product (PP) #3 1 1 Partial product (PP) #4 1 1 1 1 1 (143) Product P Product of 2 n-bit numbers is at most a 2n-bit number. We should expect to store a double-length result. Unsigned multiplication can be viewed as addition of shifted versions of the multiplicand.

Multiplication of unsigned numbers (contd..)
We added the partial products at end. Alternative would be to add the partial products at each stage. Rules to implement multiplication are: If the ith bit of the multiplier is 1, shift the multiplicand and add the shifted multiplicand to the current value of the partial product. Hand over the partial product to the next stage Value of the partial product at the start stage is 0.

Multiplication of unsigned numbers (contd..)
Typical multiplication cell Bit of incoming partial product (PPi) jth multiplicand bit ith multiplier bit ith multiplier bit carry out FA carry in Bit of outgoing partial product (PP(i+1))

Combinatorial array multiplier
Multiplicand m 3 2 1 q p 4 5 6 7 PP1 PP2 PP3 (PP0) , Product is: p7,p6,..p0 Multiplicand is shifted by displacing it through an array of adders.

Combinatorial array multiplier (contd..)
Combinatorial array multipliers are: Extremely inefficient. Have a high gate count for multiplying numbers of practical size such as 32-bit or 64-bit numbers. Perform only one function, namely, unsigned integer product. Improve gate efficiency by using a mixture of combinatorial array techniques and sequential techniques requiring less combinational logic.

Sequential multiplication
Recall the rule for generating partial products: If the ith bit of the multiplier is 1, add the appropriately shifted multiplicand to the current partial product. Multiplicand has been shifted left when added to the partial product. However, adding a left-shifted multiplicand to an unshifted partial product is equivalent to adding an unshifted multiplicand to a right-shifted partial product.

Sequential multiplication (contd..)
Register A (initially 0) Load Register A with 0. Registers are used to store multiplier and multiplicand. Each cycle repeat the following: steps: 1. If the LSB q0=1: -Add the multiplicand to A. -Store carry-out in flip-flop C Else if q0 = 0 -Do not add. 2. Shift the contents of register A and Q to the right, and discard q0. Shift right C a a q q n - 1 n - 1 Multiplier Q Add/Noadd control n -bit adder Control MUX sequencer m m n - 1 Multiplicand M

Sequential multiplication (contd..)
Register A (initially 0) Shift right C a a q q n - 1 n - 1 Multiplier Q Add/Noadd control n -bit Control adder sequencer MUX M Initial configuration m n - 1 m C A Q Multiplicand M Add First cycle Shift 1 Add Shift Second cycle No add Third cycle Shift 1 Add Shift Fourth cycle Product

Multiplication of signed-operands
Recall we discussed multiplication of unsigned numbers: Combinatorial array multiplier. Sequential multiplier. Need an approach that works uniformly with unsigned and signed (positive and negative 2’s complement) n-bit operands. Booth’s algorithm treats positive and negative 2’s complement operands uniformly.

Booth’s algorithm Booth’s algorithm applies uniformly to both unsigned and 2’s complement signed integers. Basis of other fast product algorithms. Fundamental observation: Division of an integer into the sum of block-1’s integers. Suppose we have a 16-bit binary number: This number can be represented as the sum of 4 “block-1” integers:

Booth’s algorithm (contd..)
Suppose Q is a block-1 integer: Q = = 120 Then: X.Q = X.120 Now: 120 = , so that X.Q = X.120 = X.(128-8) = X X.8 And: Q = 128 = 8 = If we label the LSB as 0, then the first 1 in the block of 1’s is at position 3 and the last one in the block of 1’s is at position 6. As a result: X.Q = X.120 = X X.8 = X.27 - X.23

Representing Block-1 integers Q is an n-bit block-1 unsigned integer: -Bit position 0 is LSB. -First 1 is in bit position j -Last 1 is in bit position k Then: Q = 2k+1 - 2j Q.X = X.(2k+1 - 2j) = X. 2k+1 - X. 2j

Let Q be the block-1 integer: Q = To form the product X.Q using normal multiplication would involve 14 add/shifts (one for each 1-valued bit in multiplier Q). Since: Q = 215 – 21 X.Q = X.(215 – 21) Product X.Q can be computed as follows: 1. Set the Partial Product (PP) to 0. 2. Subtract X.21 from PP. 3. Add X.215 to PP. Note that X.2j is equivalent to shifting X left j times.

If Q is not a block-1 integer, Q can be decomposed so that it can be represented as a sum of block-1 integers. Suppose: Q = Q can be decomposed as: = = = = Thus, Q.X = X.( )

Inputs: n-bit multiplier Q n-bit multiplicand x 2n-bit current Partial Product (PP) initially set to 0. (Upper half of PP is bits n-1 through n) Q has an added ‘0’ bit attached to the LSB (Q has n+1 bits). Algorithm: For every bit in Q: 1. Examine the bit and its neighbor to the immediate right. If the bit pair is: 00 – do nothing. 01 – Add the multiplicand to the upper half of PP. 10 – Sub the multiplicand from the upper half of PP. 11 – Do nothing. 2. Shift the PP right by one bit, extending the sign bit.

Booth’s algorithm (contd..) – Example #1
Multiplicand X Multiplier Q Double-length PP register Step 0 bit pair 10 action: SUB Shift Right Step 1 bit pair 11 action: Shift Right Step 2 bit pair 11 action: Shift Right Step 3 bit pair 01 action: ADD Shift Right Step 4 bit pair 10 action: SUB Shift Right Step 5 bit pair 11 action: Shift Right Step 6 bit pair 11 action: Shift Right Step 7 bit pair 01 action: ADD Shift Right = Product:

Multiplicand in Binary: = -119 Multiplier in Binary: = 76 Step 0 bit pair 10 action: SUB Shift Right Step 1 bit pair 11 action: Shift Right Step 2 bit pair 11 action: Shift Right Step 3 bit pair 01 action: ADD Shift Right Step 4 bit pair 10 action: SUB Shift Right Step 5 bit pair 11 action: Shift Right Step 6 bit pair 11 action: Shift Right Step 7 bit pair 01 action: ADD Shift Right = Product: Multiplier: Positive Multiplicand: 2’s comp. -ve

Multiplicand in Binary: = -119 Multiplier in Binary: = -76 Step 0 bit pair 10 action: SUB Shift Right Step 1 bit pair 01 action: ADD Shift Right Step 2 bit pair 00 action: Shift Right Step 3 bit pair 10 action: SUB Shift Right Step 4 bit pair 01 action: ADD Shift Right Step 5 bit pair 00 action: Shift Right Step 6 bit pair 00 action: Shift Right Step 7 bit pair 10 action: SUB Shift Right = Product: Multiplier: 2’s comp. -ve Multiplicand:

Booth’s algorithm and signed integers
Booth’s algorithm works correctly for any combination of 2’s complement numbers. If Q is a 2’s complement negative number, the Q is of the form: Q = 1 qn-2 qn-3 qn q1 q0 And: V(Q) = -2n-1 + qn-2 2n-2 + qn-3 2n q q q0 20 If Q has no zeros, that is: Q = = -1 Booth technique will: - See the first pair as 10 (subtract the multiplicand) - See all other pairs as 11 (do nothing) Thus, Booth technique will compute the result as required: 0 - X = -X = -1.X

Booth’s algorithm and signed integers (contd..)
Booth’s technique is correct for an arbitrary –ve 2’s complement number Q is a 2’s complement number: Q = 1 qn-2 qn-3 qn q1 q0 V(Q) = -2n-1 + qn-2 2n-2 + qn-3 2n q q q0 20 If we read Q from MSB (left hand side), Q will have certain number of 1s followed by a 0. If the first 0 appears in the mth bit of Q. V(Q) = -2n-1 + qn-2 2n-2 + qn-3 2n q q q0 20 = ( -2n n-2 + 2n m+1 ) + (qm-1 2m q q q0 20 ) = -2m (qm-1 2m q q q0 20 ) Where ( -2n n-2 + 2n m+1 ) = -2m+1 Bit m is 0, so m-bit number starting from 0 is a positive number. Booth’s algorithm will work correctly. Transitioning from bit m to m+1, algo sees the bit pair 10, causing it to subtract the multiplicand 2m+1 from the PP. All other transition see pairs 11 which have no effect.

Unsigned division Division is a more tedious process than multiplication. For the unsigned case, there are two standard approaches: 1.) Restoring division. 2.) Non restoring division. 1 13 14 26 21 274 1101 10101 1 1110 10000 Try dividing 13 into 2. Try dividing 13 into 26. Try dividing 1101 into 1, 10, 100, 1000 and

Restoring division How do we know when the divisor has gone into part of the dividend correctly? 1101 10101 1 1110 10000 Subtract 1101 from 1, result is negative Subtract 1101 from 10, result is negative. Subtract 1101 from 100, result is negative Subtract 1101 from 1000, result is negative. Subtract 1101 from 10001, result is positive.

Restoring division Strategy for unsigned division:
Shift the dividend one bit at a time starting from MSB into a register. Subtract the divisor from this register. If the result is negative (“didn’t go”): - Add the divisor back into the register. - Record 0 into the result register. If the result is positive: - Do not restore the intermediate result. - Set a 1 into the result register.

Restoring division (contd..)
Set Register A to 0. Load dividend in Q. Load divisor into M. Repeat n times: - Shift A and Q left one bit. -Subtract M from A. -Place the result in A. -If sign of A is 1, set q0 to 0 and add M back to A. Else set q0 to 1. Shift left a a a q q n n - 1 n - 1 A Dividend Q Quotient setting n + 1 -bit adder Add/Subtract Control sequencer m m End of the process: - Quotient will be in Q. - Remainder will be in A. n - 1 Divisor M Sign bit (result of sub)

Restoring division (contd..)
1 Subtract Shift Restore Initially Quotient Remainder Second cycle First cycle Third cycle Fourth cycle q Set 1

Non-restoring division
Restoring division can be improved using non-restoring algorithm The effect of restoring algorithm actually is: If A is positive, we shift it left and subtract M, that is compute 2A-M If A is negative, we restore it (A+M), shift it left, and subtract M, that is, 2(A+M) – M = 2A+M. Set q0 to 1 or 0 appropriately. Non-restoring algorithm is: Set A to 0. Repeat n times: If the sign of A is positive: Shift A and Q left and subtract M. Set q0 to 1. Else if the sign of A is negative: Shift A and Q left and add M. Set q0 to 0. If the sign of A is 1, add A to M.

Non-restoring division (contd..)
1 Quotient Shift Add Subtract Initially Fourth cycle Third cycle Second cycle First cycle q Set 1 Add Remainder 1 Restore remainder

Fractions If b is a binary vector, then we have seen that it can be interpreted as an unsigned integer by: V(b) = b b bn b b0.20 This vector has an implicit binary point to its immediate right: b31b30b b1b implicit binary point Suppose if the binary vector is interpreted with the implicit binary point is just left of the sign bit: implicit binary point .b31b30b b1b0 The value of b is then given by: V(b) = b b b b b0.2-32

Range of fractions The value of the unsigned binary fraction is:
V(b) = b b b b b0.2-32 The range of the numbers represented in this format is: In general for a n-bit binary fraction (a number with an assumed binary point at the immediate left of the vector), then the range of values is:

Scientific notation Previous representations have a fixed point. Either the point is to the immediate right or it is to the immediate left. This is called Fixed point representation. Fixed point representation suffers from a drawback that the representation can only represent a finite range (and quite small) range of numbers. A more convenient representation is the scientific representation, where the numbers are represented in the form: Components of these numbers are: Mantissa (m), implied base (b), and exponent (e) Avogadro’s number:

Significant digits A number such as the following is said to have 7 significant digits Fractions in the range 0.0 to need about 24 bits of precision (in binary). For example the binary fraction with 24 1’s: = Not every real number between 0 and can be represented by a 24-bit fractional number. The smallest non-zero number that can be represented is: = x 10-8 Every other non-zero number is constructed in increments of this value.

Sign and exponent digits
In a 32-bit number, suppose we allocate 24 bits to represent a fractional mantissa. Assume that the mantissa is represented in sign and magnitude format, and we have allocated one bit to represent the sign. We allocate 7 bits to represent the exponent, and assume that the exponent is represented as a 2’s complement integer. There are no bits allocated to represent the base, we assume that the base is implied for now, that is the base is 2. Since a 7-bit 2’s complement number can represent values in the range -64 to 63, the range of numbers that can be represented is: x < = | x | <= x 263 In decimal representation this range is: x < = | x | <= x 1018

A sample representation
Sign Exponent Fractional mantissa bit 24-bit mantissa with an implied binary point to the immediate left 7-bit exponent in 2’s complement form, and implied base is 2.

Normalization Consider the number: x = 0.0004056781 x 1012
If the number is to be represented using only 7 significant mantissa digits, the representation ignoring rounding is: x = x 1012 If the number is shifted so that as many significant digits are brought into 7 available slots: x = x 109 = x 1012 Exponent of x was decreased by 1 for every left shift of x. A number which is brought into a form so that all of the available mantissa digits are optimally used (this is different from all occupied which may not hold), is called a normalized number. Same methodology holds in the case of binary mantissas (10110) x 28 = (10) x 25

Normalization (contd..)
A floating point number is in normalized form if the most significant 1 in the mantissa is in the most significant bit of the mantissa. All normalized floating point numbers in this system will be of the form: 0.1xxxxx xx Range of numbers representable in this system, if every number must be normalized is: 0.5 x <= | x | < 1 x 263

Normalization, overflow and underflow
The procedure for normalizing a floating point number is: Do (until MSB of mantissa = = 1) Shift the mantissa left (or right) Decrement (increment) the exponent by 1 end do Applying the normalization procedure to: x 2-62 gives: x 2-65 But we cannot represent an exponent of –65, in trying to normalize the number we have underflowed our representation. Applying the normalization procedure to: x 263 gives: x 264 This overflows the representation.

Changing the implied base
So far we have assumed an implied base of 2, that is our floating point numbers are of the form: x = m 2e If we choose an implied base of 16, then: x = m 16e Then: y = (m.16) .16e-1 (m.24) .16e-1 = m . 16e = x Thus, every four left shifts of a binary mantissa results in a decrease of 1 in a base 16 exponent. Normalization in this case means shifting the mantissa until there is a 1 in the first four bits of the mantissa.

Excess notation Rather than representing an exponent in 2’s complement form, it turns out to be more beneficial to represent the exponent in excess notation. If 7 bits are allocated to the exponent, exponents can be represented in the range of -64 to +63, that is: -64 <= e <= 63 Exponent can also be represented using the following coding called as excess-64: E’ = Etrue + 64 In general, excess-p coding is represented as: E’ = Etrue + p True exponent of -64 is represented as 0 0 is represented as 64 63 is represented as 127 This enables efficient comparison of the relative sizes of two floating point numbers.

IEEE notation IEEE Floating Point notation is the standard representation in use. There are two representations: - Single precision. - Double precision. Both have an implied base of 2. Single precision: - 32 bits (23-bit mantissa, 8-bit exponent in excess-127 representation) Double precision: - 64 bits (52-bit mantissa, 11-bit exponent in excess-1023 representation) Fractional mantissa, with an implied binary point at immediate left. Sign Exponent Mantissa or or 52

Pecularities of IEEE notation
Floating point numbers have to be represented in a normalized form to maximize the use of available mantissa digits. In a base-2 representation, this implies that the MSB of the mantissa is always equal to 1. If every number is normalized, then the MSB of the mantissa is always 1. We can do away without storing the MSB. IEEE notation assumes that all numbers are normalized so that the MSB of the mantissa is a 1 and does not store this bit. So the real MSB of a number in the IEEE notation is either a 0 or a 1. The values of the numbers represented in the IEEE single precision notation are of the form: (+,-) 1.M x 2(E - 127) The hidden 1 forms the integer part of the mantissa. Note that excess-127 and excess-1023 (not excess-128 or excess-1024) are used to represent the exponent.

Exponent field In the IEEE representation, the exponent is in excess-127 (excess-1023) notation. The actual exponents represented are: -126 <= E <= and <= E <= 1023 not -127 <= E <= and <= E <= 1024 This is because the IEEE uses the exponents -127 and 128 (and and 1024), that is the actual values 0 and 255 to represent special conditions: - Exact zero - Infinity

Floating point arithmetic
Addition: x x 106 = x x 108 = x 108 Multiplication: x 108 x 1.19 x 106 = ( x 1.19 ) x 10(8+6) Division: x 108 / 1.19 x = ( / 1.19 ) x 10(8-6) Biased exponent problem: If a true exponent e is represented in excess-p notation, that is as e+p. Then consider what happens under multiplication: a. 10(x + p) * b. 10(y + p) = (a.b). 10(x + p + y +p) = (a.b). 10(x +y + 2p) Representing the result in excess-p notation implies that the exponent should be x+y+p. Instead it is x+y+2p. Biases should be handled in floating point arithmetic.

Floating point arithmetic: ADD/SUB rule
Choose the number with the smaller exponent. Shift its mantissa right until the exponents of both the numbers are equal. Add or subtract the mantissas. Determine the sign of the result. Normalize the result if necessary and truncate/round to the number of mantissa bits. Note: This does not consider the possibility of overflow/underflow.

Floating point arithmetic: MUL rule
Add the exponents. Subtract the bias. Multiply the mantissas and determine the sign of the result. Normalize the result (if necessary). Truncate/round the mantissa of the result.

Floating point arithmetic: DIV rule
Subtract the exponents Add the bias. Divide the mantissas and determine the sign of the result. Normalize the result if necessary. Truncate/round the mantissa of the result. Note: Multiplication and division does not require alignment of the mantissas the way addition and subtraction does.

Guard bits While adding two floating point numbers with 24-bit mantissas, we shift the mantissa of the number with the smaller exponent to the right until the two exponents are equalized. This implies that mantissa bits may be lost during the right shift (that is, bits of precision may be shifted out of the mantissa being shifted). To prevent this, floating point operations are implemented by keeping guard bits, that is, extra bits of precision at the least significant end of the mantissa. The arithmetic on the mantissas is performed with these extra bits of precision. After an arithmetic operation, the guarded mantissas are: - Normalized (if necessary) - Converted back by a process called truncation/rounding to a 24-bit mantissa.

Truncation/rounding Straight chopping: Von Neumann rounding: Rounding:
The guard bits (excess bits of precision) are dropped. Von Neumann rounding: If the guard bits are all 0, they are dropped. However, if any bit of the guard bit is a 1, then the LSB of the retained bit is set to 1. Rounding: If there is a 1 in the MSB of the guard bit then a 1 is added to the LSB of the retained bits.

Errors Suppose if we have a 6-bit mantissa and suppose we have 6 guard bits. Bits b1 through b6 are mantissa bits, and bits b7 through b12 are guard bits 0.b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 In order to truncate the 12-bit mantissa to a 6-bit mantissa, any of the earlier three methods can be used. Straight chopping: All errors are positive. The truncated 6-bit fraction is a smaller (or equal) number than the 12-bit guarded number. This process thus involves biased errors.

Errors (contd..) Von Neumann rounding:
Errors are evenly distributed about 0. For example: is rounded to A larger number than is rounded to A smaller number than

Rounding 0.b1 b2 b3 b4 b5 b6 1 b8 b9 b10 b11 b12 rounds to 0.b1 b2 b3 b4 b5 b 0.b1 b2 b3 b4 b5 b6 0 b8 b9 b10 b11 b12 rounds to 0.b1 b2 b3 b4 b5 b6 Rounding corresponds to the phrase “round to the nearest number”. Errors are symmetrical about 0. Errors are smaller than the errors in the other cases. Consider the number: 0.b1 b2 b3 b4 b5 b This is exactly the mid point of the numbers: 0.b1 b2 b3 b4 b5 b6 and 0.b1 b2 b3 b4 b5 b The errors can be kept unbiased if a rule such as: “Choose the retained bits to the nearest even number” 0.b1 b2 b3 b4 b rounds to 0.b1 b2 b3 b4 b5 0 0.b1 b2 b3 b4 b rounds to 0.b1 b2 b3 b4 b

Rounding Rounding is evidently the most accurate truncation method.
However, Rounding requires an addition operation. Rounding may require a renormalization, if the addition operation de-normalizes the truncated number. IEEE uses the rounding method. rounds to = which must be renormalized to

Basic concepts Maximum size of the memory depends on the addressing scheme: 16-bit computer generates 16-bit addresses and can address up to 216 memory locations. Number of locations represents the size of the address space of a computer. Most modern computers are byte-addressable. Memory is designed to store and retrieve data in word-length quantities. Word length of a computer is commonly defined as the number of bits stored and retrieved in one memory operation.

Basic concepts (contd..)
Memory Processor k -bit address bus MAR n -bit data bus Up to 2 k addressable MDR locations Word length = n bits Control lines ( , MFC, etc.) R / W Recall that the data transfers between a processor and memory involves two registers MAR and MDR. If the address bus is k-bits, then the length of MAR is k bits. If the word length is n-bits, then the length of MDR is n bits. Control lines include R/W and MFC. For Read operation R/W = 1 and for Write operation R/W = 0.

Measures for the speed of a memory: Elapsed time between the initiation of an operation and the completion of an operation is the memory access time. Minimum time between the initiation of two successive memory operations is memory cycle time. In general, the faster a memory system, the costlier it is and the smaller it is. An important design issue is to provide a computer system with as large and fast a memory as possible, within a given cost target. Several techniques to increase the effective size and speed of the memory: Cache memory (to increase the effective speed). Virtual memory (to increase the effective size).

Semiconductor RAM memories
Random Access Memory (RAM) memory unit is a unit where any location can be addressed in a fixed amount of time, independent of the location’s address. Internal organization of memory chips: Each memory cell can hold one bit of information. Memory cells are organized in the form of an array. One row is one memory word. All cells of a row are connected to a common line, known as the “word line”. Word line is connected to the address decoder. Sense/write circuits are connected to the data input/output lines of the memory chip.

Semiconductor RAM memories (contd..)
Internal organization of memory chips 7 7 1 1 W • FF FF A W • 1 A 1 Address • • • • • • Memory decoder cells A 2 A 3 W • 15 Sense / Write Sense / Write Sense / Write R / W circuit circuit circuit CS Data input /output lines: b b b 7 1

Internal organization of memory chips 5-bit row address W W 1 32 32 5-bit memory cell decoder array W 31 Sense / Write circuitry 10-bit address 32-to-1 R / W output multiplexer and CS input demultiplexer 5-bit column address Data input/output

Static RAMs (SRAMs): Consist of circuits that are capable of retaining their state as long as the power is applied. Volatile memories, because their contents are lost when power is interrupted. Access times of static RAMs are in the range of few nanoseconds. However, the cost is usually high. Dynamic RAMs (DRAMs): Do not retain their state indefinitely. Contents must be periodically refreshed. Contents may be refreshed while accessing them for reading.

Internal organization of a Dynamic RAM memory chip Organized as 4kx4k array. 4096 cells in each row are divided into 512 groups of 8. Each row can store 512 bytes. 12 bits to select a row, and 9 bits to select a group in a row. Total of 21 bits. Reduce the number of bits by multiplexing row and column addresses. First apply the row address, RAS signal latches the row address. Then apply the column address, CAS signal latches the address. Timing of the memory unit is controlled by a specialized unit which generates RAS and CAS. This is asynchronous DRAM. R A S Row Row 4096 ( 512 8 ) address decoder cell array latch CS Sense / Write A 20 9 - 8 circuits R / W Column Column address decoder latch C A S

Recall the operation of the memory: First all the contents of a row are selected based on a row address. Particular byte is selected based on the column address. Suppose if we want to access the consecutive bytes in the selected row. This can be done without having to reselect the row. Add a latch at the output of the sense circuits in each row. All the latches are loaded when the row is selected. Different column addresses can be applied to select and place different bytes on the data lines.

Consecutive sequence of column addresses can be applied under the control signal CAS, without reselecting the row. Allows a block of data to be transferred at a much faster rate than random accesses. A small collection/group of bytes is usually referred to as a block. This transfer capability is referred to as the fast page mode feature.

Operation is directly synchronized with processor clock signal. Synchronous DRAMs. The outputs of the sense circuits are connected to a latch. During a Read operation, the contents of the cells in a row are loaded onto the latches. During a refresh operation, the contents of the cells are refreshed without changing the contents of the latches. Data held in the latches correspond to the selected columns are transferred to the output. For a burst mode of operation, successive columns are selected using column address counter and clock. CAS signal need not be generated externally. Refresh counter Row Ro w address decoder Cell array latch Row/Column address Column Co lumn Read/Write address decoder circuits & latches counter Clock R A S Mode register Data input Data output C A S and register register R / W timing control C S Data

21-bit Implement a memory unit of 2M words of 32 bits each. Use 512x8 static memory chips. Each column consists of 4 chips. Each chip implements one byte position. A chip is selected by setting its chip select control line to 1. Selected chip places its data on the data output line, outputs of other chips are in high impedance state. 21 bits to address a 32-bit word. High order 2 bits are needed to select the row, by activating the four Chip Select signals. 19 bits are used to access specific byte locations inside the selected chip. addresses 19-bit internal chip address A A 1 A 19 A 20 2-bit decoder 512 K 8 memory chip D D D D 31-24 23-16 15-8 7-0 512 K 8 memory chip 19-bit 8-bit data address input/output Chip select

Large dynamic memory systems can be implemented using DRAM chips in a similar way to static memory systems. Placing large memory systems directly on the motherboard will occupy a large amount of space. Also, this arrangement is inflexible since the memory system cannot be expanded easily. Packaging considerations have led to the development of larger memory units known as SIMMs (Single In-line Memory Modules) and DIMMs (Dual In-line Memory Modules). Memory modules are an assembly of memory chips on a small board that plugs vertically onto a single socket on the motherboard. Occupy less space on the motherboard. Allows for easy expansion by replacement.

Recall that the rows of the Dynamic RAM need to be accessed periodically in order to be refreshed. Older DRAMs typical refreshing period was 16 ms. SDRAMs typical refreshing period is 64 ms. Let us consider a SDRAM with 8192 rows: Suppose it takes 4 clock cycles to access each row. Total of 4x8192 = clock cycles. If the clock rate is 133 MHz, then the total refresh time if seconds. Refreshing must be done every 64 ms. Overhead is

Data is transferred between the processor and the memory in units of single word or a block. Speed and efficiency of data transfer has a large impact on the performance of a computer. Two metrics of performance: Latency and Bandwidth. Latency: Time taken to transfer a single word of data to or from memory. Definition is clear if the memory operation involves transfer of a single word of data. In case of a block transfer, latency is the time it takes to transfer first word of data. Time required to transfer first word in a block is substantially larger than the time required to transfer consecutive words in a block.

How much time is needed to transfer a single block of data. Blocks can be variable in size, it is useful to define a performance measure in terms of the number of bits or bytes transferred in one second. This performance measure is referred to as memory bandwidth. Bandwidth of a memory unit depends on: Speed of access to the chip. How many bits can be accessed in parallel. Bandwidth of data transfer between processor and memory unit also depends on: Transfer capability of the links that connect the processor and memory, or the speed of the bus.

Memory systems Various factors such as cost, speed, power consumption and size of the chip determine how a RAM is chosen for a given application. Static RAMs: Chosen when speed is the primary concern. Circuit implementing the basic cell is highly complex, so cost and size are affected. Used mostly in cache memories. Dynamic RAMs: Predominantly used for implementing computer main memories. High densities available in these chips. Economically viable for implementing large memories.

Memory controller Recall that in a dynamic memory chip, to reduce the number of pins, multiplexed addresses are used. Address is divided into two parts: High-order address bits select a row in the array. They are provided first, and latched using RAS signal. Low-order address bits select a column in the row. They are provided later, and latched using CAS signal. However, a processor issues all address bits at the same time. In order to achieve the multiplexing, memory controller circuit is inserted between the processor and memory.

Memory controller (contd..)
Row/Column address Address R A S R / W C A S Memory Request controller R / W Processor Memory C S Clock Clock Data Memory controller accepts the complete address, R/W signal from the processor, under the request of a control signal. Controller forwards row and column address portions to the memory, and issues RAS and CAS signals. It also sends R/W and CS signals to the memory. Data lines are connected directly between the processor and memory.

Read-Only Memories (ROMs)
SRAM and SDRAM chips are volatile: Lose the contents when the power is turned off. Many applications need memory devices to retain contents after the power is turned off. For example, computer is turned on, the operating system must be loaded from the disk into the memory. Store instructions which would load the OS from the disk. Need to store these instructions so that they will not be lost after the power is turned off. We need to store the instructions into a non-volatile memory. Non-volatile memory is read in the same manner as volatile memory. Separate writing process is needed to place information in this memory. Normal operation involves only reading of data, this type of memory is called Read-Only memory (ROM).

Read-Only Memories (contd..)
Read-Only Memory: Data are written into a ROM when it is manufactured. Programmable Read-Only Memory (PROM): Allow the data to be loaded by a user. Process of inserting the data is irreversible. Storing information specific to a user in a ROM is expensive. Providing programming capability to a user may be better. Erasable Programmable Read-Only Memory (EPROM): Stored data to be erased and new data to be loaded. Flexibility, useful during the development phase of digital systems. Erasable, reprogrammable ROM. Erasure requires exposing the ROM to UV light.

Read-Only Memories (contd..)
Electrically Erasable Programmable Read-Only Memory (EEPROM): To erase the contents of EPROMs, they have to be exposed to ultraviolet light. Physically removed from the circuit. EEPROMs the contents can be stored and erased electrically.

Flash memory Flash memory has similar approach to EEPROM.
Read the contents of a single cell, but write the contents of an entire block of cells. Flash devices have greater density. Higher capacity and low storage cost per bit. Power consumption of flash memory is very low, making it attractive for use in equipment that is battery-driven. Single flash chips are not sufficiently large, so larger memory modules are implemented using flash cards and flash drives.

Acronyms RAM --Random Access Memory time taken to access any arbitrary location in memory is constant (c.f., disks) ROM --Read Only Memory ROMs are RAMs which can only be written to once; thereafter they can only be read Older uses included storage of bootstrap info PROM --Programmable ROM A ROM which can be bench programmed EPROM --Erasable PROM A PROM which can be erased for rewriting EEPROM --Electrically EPROM A PROM which can be erased electrically. SRAM --Static RAM RAM chip which loses contents upon power off DRAM --Dynamic RAM RAM chip whose contents need to be refreshed. SDRAM --Synchronous DRAM RAM whose memory operations are synchronized with a clock signal.

Memory hierarchy A big challenge in the design of a computer system is to provide a sufficiently large memory, with a reasonable speed at an affordable cost. Static RAM: Very fast, but expensive, because a basic SRAM cell has a complex circuit making it impossible to pack a large number of cells onto a single chip. Dynamic RAM: Simpler basic cell circuit, hence are much less expensive, but significantly slower than SRAMs. Magnetic disks: Storage provided by DRAMs is higher than SRAMs, but is still less than what is necessary. Secondary storage such as magnetic disks provide a large amount of storage, but is much slower than DRAMs.

Memory hierarchy (contd..)
All these types of memory units are employed effectively in a computer. SRAM -- Smaller units where speed is critical, namely, cache memories. DRAM -- Large, yet affordable memory, namely, main memory. Magnetic disks -- Huge amount of cost-effective storage. Computer memory can be viewed as a hierarchy.

Memory hierarchy (contd..)
Pr ocessor Primary cache Main memory Increasing size speed Magnetic disk secondary cost per bit Re gisters L1 Secondary L2 Fastest access is to the data held in processor registers. Registers are at the top of the memory hierarchy. Relatively small amount of memory that can be implemented on the processor chip. This is processor cache. Two levels of cache. Level 1 (L1) cache is on the processor chip. Level 2 (L2) cache is in between main memory and processor. Next level is main memory, implemented as SIMMs. Much larger, but much slower than cache memory. Next level is magnetic disks. Huge amount of inexepensive storage. Speed of memory access is critical, the idea is to bring instructions and data that will be used in the near future as close to the processor as possible.

Cache memories Processor is much faster than the main memory.
As a result, the processor has to spend much of its time waiting while instructions and data are being fetched from the main memory. Major obstacle towards achieving good performance. Speed of the main memory cannot be increased beyond a certain point. Cache memory is an architectural arrangement which makes the main memory appear faster to the processor than it really is. Cache memory is based on the property of computer programs known as “locality of reference”.

Locality of reference Analysis of programs indicates that many instructions in localized areas of a program are executed repeatedly during some period of time, while the others are accessed relatively less frequently. These instructions may be the ones in a loop, nested loop or few procedures calling each other repeatedly. This is called “locality of reference”. Temporal locality of reference: Recently executed instruction is likely to be executed again very soon. Spatial locality of reference: Instructions with addresses close to a recently instruction are likely to be executed soon.

Locality of reference (contd..)
Cache memory is based on the concept of locality of reference. If active segments of a program are placed in a fast cache memory, then the execution time can be reduced. Temporal locality of reference: Whenever an instruction or data is needed for the first time, it should be brought into a cache. It will hopefully be used again repeatedly. Spatial locality of reference: Instead of fetching just one item from the main memory to the cache at a time, several items that have addresses adjacent to the item being fetched may be useful. The term “block” refers to a set of contiguous addresses locations of some size.

Cache memories Main Processor Cache memory
Processor issues a Read request, a block of words is transferred from the main memory to the cache, one word at a time. Subsequent references to the data in this block of words are found in the cache. At any given time, only some blocks in the main memory are held in the cache. Which blocks in the main memory are in the cache is determined by a “mapping function”. When the cache is full, and a block of words needs to be transferred from the main memory, some block of words in the cache must be replaced. This is determined by a “replacement algorithm”.

Cache memories (contd..)
Existence of a cache is transparent to the processor. The processor issues Read and Write requests in the same manner. If the data is in the cache it is called a Read or Write hit. Read hit: - The data is obtained from the cache. Write hit: - Cache is a replica of the contents of the main memory. - Contents of the cache and the main memory may be updated simultaneously. This is the write-through protocol. - Update the contents of the cache, and mark it as updated by setting a bit known as the dirty bit or modified bit. The contents of the main memory are updated when this block is replaced. This is write-back or copy-back protocol.

If the data is not present in the cache, then a Read miss or Write miss occurs. Read miss: - Block of words containing this requested word is transferred from the memory. - After the block is transferred, the desired word is forwarded to the processor. - The desired word may also be forwarded to the processor as soon as it is transferred without waiting for the entire block to be transferred. This is called load-through or early-restart. Write-miss: - Write-through protocol is used, then the contents of the main memory are updated directly. - If write-back protocol is used, the block containing the addressed word is first brought into the cache. The desired word is overwritten with new information.

A bit called as “valid bit” is provided for each block. If the block contains valid data, then the bit is set to 1, else it is 0. Valid bits are set to 0, when the power is just turned on. When a block is loaded into the cache for the first time, the valid bit is set to 1. Data transfers between main memory and disk occur directly bypassing the cache. When the data on a disk changes, the main memory block is also updated. However, if the data is also resident in the cache, then the valid bit is set to 0. What happens if the data in the disk and main memory changes and the write-back protocol is being used? In this case, the data in the cache may also have changed and is indicated by the dirty bit. The copies of the data in the cache, and the main memory are different. This is called the cache coherence problem. One option is to force a write-back before the main memory is updated from the disk..

Mapping functions Mapping functions determine how memory blocks are placed in the cache. A simple processor example: Cache consisting of 128 blocks of 16 words each. Total size of cache is 2048 (2K) words. Main memory is addressable by a 16-bit address. Main memory has 64K words. Main memory has 4K blocks of 16 words each. Consecutive addresses refer to consecutive words. Three mapping functions: Direct mapping Associative mapping Set-associative mapping.

Direct mapping Block j of the main memory maps to j modulo 128 of
the cache. 0 maps to 0, 129 maps to 1. More than one memory block is mapped onto the same position in the cache. May lead to contention for cache blocks even if the cache is not full. Resolve the contention by allowing new block to replace the old block, leading to a trivial replacement algorithm. Memory address is divided into three fields: - Low order 4 bits determine one of the 16 words in a block. - When a new block is brought into the cache, the the next 7 bits determine which cache block this new block is placed in. - High order 5 bits determine which of the possible 32 blocks is currently present in the cache. These are tag bits. Simple to implement but not very flexible. Block 0 Cache Block 1 tag Block 0 tag Block 1 Block 127 Block 128 tag Block 127 Block 129 Block 255 7 4 Main memory address T ag Block W ord 5 Block 256 Block 257 Block 4095

Associative mapping Main memory block can be placed into any cache
position. Memory address is divided into two fields: - Low order 4 bits identify the word within a block. - High order 12 bits or tag bits identify a memory block when it is resident in the cache. Flexible, and uses cache space efficiently. Replacement algorithms can be used to replace an existing block in the cache when the cache is full. Cost is higher than direct-mapped cache because of the need to search all 128 patterns to determine whether a given block is in the cache. Block 0 Cache Block 1 tag Block 0 tag Block 1 Block 127 Block 128 tag Block 127 Block 129 Block 255 T ag W ord Block 256 12 4 Block 257 Main memory address Block 4095

Set-Associative mapping
Main memory Blocks of cache are grouped into sets. Mapping function allows a block of the main memory to reside in any block of a specific set. Divide the cache into 64 sets, with two blocks per set. Memory block 0, 64, 128 etc. map to block 0, and they can occupy either of the two positions. Memory address is divided into three fields: - 6 bit field determines the set number. - High order 6 bit fields are compared to the tag fields of the two blocks in a set. Set-associative mapping combination of direct and associative mapping. Number of blocks per set is a design parameter. - One extreme is to have all the blocks in one set, requiring no set bits (fully associative mapping). - Other extreme is to have one block per set, is the same as direct mapping. Block 0 tag Cache Block 0 Block 1 Block 126 Block 2 Block 3 Block 127 Set 0 Set 1 Set 63 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 T ag Set W ord Block 257 6 6 4 Main memory address Block 4095

Replacement algorithms
Direct-mapped cache, the position that each memory block occupies in the cache is fixed. As a result, the replacement strategy is trivial. Associative and set-associative mapping provide some flexibility in deciding which memory block occupies which cache block. When a new block is to be transferred to the cache, and all the positions it may occupy are full, which block in the cache should be replaced? Locality of reference suggests that it may be okay to replace the block that has gone the longest time without being referenced. This block is called Least Recently Used (LRU) block, and the replacement strategy is called LRU replacement algorithm. LRU algorithm has been used extensively. It provides poor performance in some cases. Performance of the LRU algorithm may be improved by introducing a small amount of randomness into which block might be replaced. Other replacement algorithms include removing the “oldest” block. However, they disregard the locality of reference principle and do not perform as well.

Performance considerations
A key design objective is to achieve the best possible performance at the lowest possible cost. Price/performance ratio is a common measure. Performance of a processor depends on: How fast machine instructions can be brought into the processor for execution. How fast the instructions can be executed. Memory hierarchy described earlier was created to increase the speed and size of the memory at an affordable cost. Data need to be transferred between various units of this hierarchy as well. Speed and efficiency of data transfer between these various memory units also impacts the performance.

Interleaving Main memory of a computer is structured as a collection of modules. Each module has its own address buffer register (ABR) and data buffer register (DBR). Memory access operations can proceed in more than one module increasing the rate of transfer of words to and from the main memory. How individual addresses are distributed over the modules is critical in determining how many modules can be kept busy simultaneously.

Interleaving (contd..) Consecutive words are placed in a module.
k bits m bits Module Address in module MM address ABR DBR ABR DBR ABR DBR Module Module Module i n - 1 Consecutive words are placed in a module. High-order k bits of a memory address determine the module. Low-order m bits of a memory address determine the word within a module. When a block of words is transferred from main memory to cache, only one module is busy at a time. Other modules may be involved in data transfers between main memory and disk.

Interleaving (contd..) m bits k bits Address in module Module MM address ABR DBR ABR DBR ABR DBR Module Module Module i 2 k - 1 Consecutive words are located in consecutive modules. Low-order k bits select a module. High-order k bits select a word location within that module. Consecutive addresses can be located in consecutive modules. While transferring a block of data, several memory modules can be kept busy at the same time. Faster access and high memory utilization. This is called “Memory interleaving”.

Performance enhancements
Write buffer Write-through: Each write operation involves writing to the main memory. If the processor has to wait for the write operation to be complete, it slows down the processor. Processor does not depend on the results of the write operation. Write buffer can be included for temporary storage of write requests. Processor places each write request into the buffer and continues execution. If a subsequent Read request references data which is still in the write buffer, then this data is referenced in the write buffer. Write-back: Block is written back to the main memory when it is replaced. If the processor waits for this write to complete, before reading the new block, it is slowed down. Fast write buffer can hold the block to be written, and the new block can be read first.

Prefetching New data are brought into the processor when they are first needed. Processor has to wait before the data transfer is complete. Prefetch the data into the cache before they are actually needed, or a before a Read miss occurs. Prefetching should occur (hopefully) when the processor is busy executing instructions that do not result in a read miss. Prefetching can be accomplished through software by including a special instruction in the machine language of the processor. - Inclusion of prefetch instructions increases the length of the programs. Prefetching can also be accomplished using hardware: - Circuitry that attempts to discover patterns in memory references and then prefetches according to this pattern.

Lockup-Free Cache Prefetching scheme does not work if it stops other accesses to the cache until the prefetch is completed. A cache of this type is said to be “locked” while it services a miss. Cache structure which supports multiple outstanding misses is called a lockup free cache. Since only one miss can be serviced at a time, a lockup free cache must include circuits that keep track of all the outstanding misses. Special registers may hold the necessary information about these misses.

Virtual memories Recall that an important challenge in the design of a computer system is to provide a large, fast memory system at an affordable cost. Architectural solutions to increase the effective speed and size of the memory system. Cache memories were developed to increase the effective speed of the memory system. Virtual memory is an architectural solution to increase the effective size of the memory system.

Virtual memories (contd..)
Recall that the addressable memory space depends on the number of address bits in a computer. For example, if a computer issues 32-bit addresses, the addressable memory space is 4G bytes. Physical main memory in a computer is generally not as large as the entire possible addressable space. Physical memory typically ranges from a few hundred megabytes to 1G bytes. Large programs that cannot fit completely into the main memory have their parts stored on secondary storage devices such as magnetic disks. Pieces of programs must be transferred to the main memory from secondary storage before they can be executed.

When a new piece of a program is to be transferred to the main memory, and the main memory is full, then some other piece in the main memory must be replaced. Recall this is very similar to what we studied in case of cache memories. Operating system automatically transfers data between the main memory and secondary storage. Application programmer need not be concerned with this transfer. Also, application programmer does not need to be aware of the limitations imposed by the available physical memory.

Techniques that automatically move program and data between main memory and secondary storage when they are required for execution are called virtual-memory techniques. Programs and processors reference an instruction or data independent of the size of the main memory. Processor issues binary addresses for instructions and data. These binary addresses are called logical or virtual addresses. Virtual addresses are translated into physical addresses by a combination of hardware and software subsystems. If virtual address refers to a part of the program that is currently in the main memory, it is accessed immediately. If the address refers to a part of the program that is not currently in the main memory, it is first transferred to the main memory before it can be used.

Virtual memory organization
Processor Memory management unit (MMU) translates virtual addresses into physical addresses. If the desired data or instructions are in the main memory they are fetched as described previously. If the desired data or instructions are not in the main memory, they must be transferred from secondary storage to the main memory. MMU causes the operating system to bring the data from the secondary storage into the main memory. Virtual address Data MMU Physical address Cache Data Physical address Main memory DMA transfer Disk storage

Address translation Assume that program and data are composed of fixed-length units called pages. A page consists of a block of words that occupy contiguous locations in the main memory. Page is a basic unit of information that is transferred between secondary storage and main memory. Size of a page commonly ranges from 2K to 16K bytes. Pages should not be too small, because the access time of a secondary storage device is much larger than the main memory. Pages should not be too large, else a large portion of the page may not be used, and it will occupy valuable space in the main memory.

Address translation (contd..)
Concepts of virtual memory are similar to the concepts of cache memory. Cache memory: Introduced to bridge the speed gap between the processor and the main memory. Implemented in hardware. Virtual memory: Introduced to bridge the speed gap between the main memory and secondary storage. Implemented in part by software.

Each virtual or logical address generated by a processor is interpreted as a virtual page number (high-order bits) plus an offset (low-order bits) that specifies the location of a particular byte within that page. Information about the main memory location of each page is kept in the page table. Main memory address where the page is stored. Current status of the page. Area of the main memory that can hold a page is called as page frame. Starting address of the page table is kept in a page table base register.

Virtual page number generated by the processor is added to the contents of the page table base register. This provides the address of the corresponding entry in the page table. The contents of this location in the page table give the starting address of the page if the page is currently in the main memory.

PTBR holds the address of the page table. Virtual address from processor Page table base register Page table address Virtual page number Offset Virtual address is interpreted as page number and offset. + PTBR + virtual page number provide the entry of the page in the page table. PAGE TABLE This entry has the starting location of the page. Page table holds information about each page. This includes the starting address of the page in the main memory. Control Page frame bits in memory Page frame Offset Physical address in main memory

Page table entry for a page also includes some control bits which describe the status of the page while it is in the main memory. One bit indicates the validity of the page. Indicates whether the page is actually loaded into the main memory. Allows the operating system to invalidate the page without actually removing it. One bit indicates whether the page has been modified during its residency in the main memory. This bit determines whether the page should be written back to the disk when it is removed from the main memory. Similar to the dirty or modified bit in case of cache memory.

Other control bits for various other types of restrictions that may be imposed. For example, a program may only have read permission for a page, but not write or modify permissions.

Where should the page table be located? Recall that the page table is used by the MMU for every read and write access to the memory. Ideal location for the page table is within the MMU. Page table is quite large. MMU is implemented as part of the processor chip. Impossible to include a complete page table on the chip. Page table is kept in the main memory. A copy of a small portion of the page table can be accommodated within the MMU. Portion consists of page table entries that correspond to the most recently accessed pages.

A small cache called as Translation Lookaside Buffer (TLB) is included in the MMU. TLB holds page table entries of the most recently accessed pages. Recall that cache memory holds most recently accessed blocks from the main memory. Operation of the TLB and page table in the main memory is similar to the operation of the cache and main memory. Page table entry for a page includes: Address of the page frame where the page resides in the main memory. Some control bits. In addition to the above for each page, TLB must hold the virtual page number for each page.

Virtual address from processor Associative-mapped TLB Virtual page number Offset High-order bits of the virtual address generated by the processor select the virtual page. These bits are compared to the virtual page numbers in the TLB. If there is a match, a hit occurs and the corresponding address of the page frame is read. If there is no match, a miss occurs and the page table within the main memory must be consulted. Set-associative mapped TLBs are found in commercial processors. TLB Virtual page Control Page frame number bits in memory No =? Yes Miss Hit Page frame Offset Physical address in main memory

How to keep the entries of the TLB coherent with the contents of the page table in the main memory? Operating system may change the contents of the page table in the main memory. Simultaneously it must also invalidate the corresponding entries in the TLB. A control bit is provided in the TLB to invalidate an entry. If an entry is invalidated, then the TLB gets the information for that entry from the page table. Follows the same process that it would follow if the entry is not found in the TLB or if a “miss” occurs.

What happens if a program generates an access to a page that is not in the main memory? In this case, a page fault is said to occur. Whole page must be brought into the main memory from the disk, before the execution can proceed. Upon detecting a page fault by the MMU, following actions occur: MMU asks the operating system to intervene by raising an exception. Processing of the active task which caused the page fault is interrupted. Control is transferred to the operating system. Operating system copies the requested page from secondary storage to the main memory. Once the page is copied, control is returned to the task which was interrupted.

Servicing of a page fault requires transferring the requested page from secondary storage to the main memory. This transfer may incur a long delay. While the page is being transferred the operating system may: Suspend the execution of the task that caused the page fault. Begin execution of another task whose pages are in the main memory. Enables efficient use of the processor.

How to ensure that the interrupted task can continue correctly when it resumes execution? There are two possibilities: Execution of the interrupted task must continue from the point where it was interrupted. The instruction must be restarted. Which specific option is followed depends on the design of the processor.

When a new page is to be brought into the main memory from secondary storage, the main memory may be full. Some page from the main memory must be replaced with this new page. How to choose which page to replace? This is similar to the replacement that occurs when the cache is full. The principle of locality of reference (?) can also be applied here. A replacement strategy similar to LRU can be applied. Since the size of the main memory is relatively larger compared to cache, a relatively large amount of programs and data can be held in the main memory. Minimizes the frequency of transfers between secondary storage and main memory.

A page may be modified during its residency in the main memory. When should the page be written back to the secondary storage? Recall that we encountered a similar problem in the context of cache and main memory: Write-through protocol(?) Write-back protocol(?) Write-through protocol cannot be used, since it will incur a long delay each time a small amount of data is written to the disk.

Memory management Operating system is concerned with transferring programs and data between secondary storage and main memory. Operating system needs memory routines in addition to the other routines. Operating system routines are assembled into a virtual address space called system space. System space is separate from the space in which user application programs reside. This is user space. Virtual address space is divided into one system space + several user spaces.

Memory management (contd..)
Recall that the Memory Management Unit (MMU) translates logical or virtual addresses into physical addresses. MMU uses the contents of the page table base register to determine the address of the page table to be used in the translation. Changing the contents of the page table base register can enable us to use a different page table, and switch from one space to another. At any given time, the page table base register can point to one page table. Thus, only one page table can be used in the translation process at a given time. Pages belonging to only one space are accessible at any given time.

Memory management (contd..)
When multiple, independent user programs coexist in the main memory, how to ensure that one program does not modify/destroy the contents of the other? Processor usually has two states of operation: Supervisor state. User state. Supervisor state: Operating system routines are executed. User state: User programs are executed. Certain privileged instructions cannot be executed in user state. These privileged instructions include the ones which change page table base register. Prevents one user from accessing the space of other users.

Example #1: Effect of Interleaving
Consider a cache which has 8 words per block. On a read miss, the block that contains the desired word must be copied from the memory into the cache. Assume that the hardware has following properties. It takes 1 clock cycle to send an address to the main memory. The first word is accessed in 8 clock cycles, and subsequent words are accessed in 4 clock cycles. Also, one clock cycle is necessary to send the word to the cache. How many clock cycles does it take to send the block of words to the cache? The total time taken is (7x4) +1 = 38

Example #1: Effect of Interleaving
If the memory is constructed as four interleaved modules, then when the starting address of the block arrives at the memory, all four modules being accessing the required data using the high order bits of the address. After 8 clock cycles, each module has one word of data in its DBR. These words are transferred to the cache one word at a time during the next 4 clock cycles. During this time, the next word in each module is accessed. Then it takes another 4 clock cycles to transfer these words to the cache. Therefore the total time taken is =17. Speed up obtained during interleaving is 38/17 = 2.2

Example #2: Effect of cache on processor chip
Consider the impact of the cache on the overall performance of the computer. Let h be the hit rate, M be the miss penalty, that is, the time to access information in the main memory, and C the time to access information in the cache. Then, the average access time experienced by the processor is given by: Refer to page 332 of the text book Let us consider the following example. If the computer has no cache, then it takes 10 clock cycles for every memory read access. For a computer which has a cache that holds 8 word blocks and an interleaved main memory, it takes 17 clock cycles to transfer a block from the main memory to the cache. Assume that 30% of the instructions require a memory access, so there are 130 memory accesses for every 100 instructions executed. Assume that the hit rate in the cache are 0.95 for instructions and 0.9 for data. Then, the improvement in performance is: 130x10/100(0.95x x17) + 30(0.9x1+0.1x17)=5.04

Example #3: Effect of L1 & L2 cache.
Consider the impact of L1 and L2 cache on the overall performance of the processor. Let h1 be hit rate in cache L1, h2 the hit rate in cache L2, C1 the time to access information in L1 cache, C2 time to access information in L2 cache, M is the time to access information in the main memory. Then, the average access time of the processor is given by: Refer to page 335 of the text book.

Example #4: Set-associative cache
A computer system has a main memory of 64K 16-bit words. It consists of a cache of 128 blocks with 16 words per block organized in a block set associative manner with 2 blocks per set. (a) Calculate the number of bits in each of the TAG, SET and WORD fields of the main memory address format. (b) Assume that the cache is initially empty. Suppose that the processor fetches 2080 words from locations 0,1, , in that order. It then repeats this fetch sequence nine more times. If the cache is 10 times faster than the main memory, estimate the improvement factor resulting from the use of the cache. Assume that the LRU algorithm is used for block replacement. (a) The main memory address is 16 bits. The number of bits in the WORD field is 4. The number of bits in the SET field is 6. The number of bits in the TAG field is 16 - (6+4) = 6

Example #4: Set-associative cache
Words 0, 1, 2,....,2079 occupy blocks 0 to 129 in the main memory. After blocks 0, 127 have been read from the main memory into the cache on the first pass, the cache is full. Because the replacement algorithm is LRU, main memory blocks that occupy the first two sets of the 64 cache sets are always overwritten before they can be used on a successive pass. In particular main memory blocks 0, 64 and 128 continually displace each other in competing for the 2 block positions in cache set 0. Similarly, main memory blocks 1, 65 and 129 continually displace each other in competing for the 2 block positions in cache set 1. Main memory blocks that occupy the last 62 sets are fetched once in the first pass and remain in the cache for the next 9 pases. On the first pass all 130 blocks must be fetched from the main memory. On each of the 9 passes blocks in the last 62 sets of the cache (62x2=124) are found in the cache. The remaining 6 blocks ( ) must be fetced from the main memory. Improvement factor = Time without cache/Time with cache = 10x130x10t/(1x130x11t + 9(124x1t + 6x11t)) = 4.14

Basic Input/Output Operations
We have seen instructions to: Transfer information between the processor and the memory. Perform arithmetic and logic operations Program sequencing and flow control. Input/Output operations which transfer data from the processor or memory to and from the real world are essential. In general, the rate of transfer from any input device to the processor, or from the processor to any output device is likely to the slower than the speed of a processor. The difference in speed makes it necessary to create mechanisms to synchronize the data transfer between them.

Basic Input/Output operations (contd..)
Let us consider a simple task of reading a character from a keyboard and displaying that character on a display screen. A simple way of performing the task is called program-controlled I/O. There are two separate blocks of instructions in the I/O program that perform this task: One block of instructions transfers the character into the processor. Another block of instructions causes the character to be displayed.

Bus Processor D A T AIN D A T A OUT SIN SOUT K e yboard Display Input: When a key is struck on the keyboard, an 8-bit character code is stored in the buffer register DATAIN. A status control flag SIN is set to 1 to indicate that a valid character is in DATAIN. A program monitors SIN, and when SIN is set to 1, it reads the contents of DATAIN. When the character is transferred to the processor, SIN is automatically cleared. Initial state of SIN is 0.

AIN OUT SIN SOUT K e yboard Display Bus Processor Output: When SOUT is equal to 1, the display is ready to receive a character. A program monitors SOUT, and when SOUT is set to 1, the processor transfers a character code to the buffer DATAOUT. Transfer of a character code to DATAOUT clears SOUT to 0. Initial state of SOUT is 1.

Buffer registers DATAIN and DATAOUT, and status flags SIN and SOUT are part of a circuitry known as device interface. To perform I/O transfers, we need machine instructions to: Monitor the status of the status registers. Transfer data among the I/O devices and the processor. Instructions have similar format to the instructions used for moving data between the processor and the memory. How does the processor address the I/O devices? Some memory address values may be used to refer to peripheral device buffer registers such as DATAIN and DATAOUT. This arrangement is called as Memory-Mapped I/O.

Accessing I/O devices Bus I/O de vice 1 vice n Processor Memory Multiple I/O devices may be connected to the processor and the memory via a bus. Bus consists of three sets of lines to carry address, data and control signals. Each I/O device is assigned an unique address. To access an I/O device, the processor places the address on the address lines. The device recognizes the address, and responds to the control signals.

Accessing I/O devices (contd..)
I/O devices and the memory may share the same address space: Memory-mapped I/O. Any machine instruction that can access memory can be used to transfer data to or from an I/O device. Simpler software. I/O devices and the memory may have different address spaces: Special instructions to transfer data to and from I/O devices. I/O devices may have to deal with fewer address lines. I/O address lines need not be physically separate from memory address lines. In fact, address lines may be shared between I/O devices and memory, with a control signal to indicate whether it is a memory address or an I/O address.

int erface decoder Address Data and status registers Control circuits Input device Bus Address lines Data lines Control lines I/O device is connected to the bus using an I/O interface circuit which has: - Address decoder, control circuit, and data and status registers. Address decoder decodes the address placed on the address lines thus enabling the device to recognize its address. Data register holds the data being transferred to or from the processor. Status register holds information necessary for the operation of the I/O device. Data and status registers are connected to the data lines, and have unique addresses. I/O interface circuit coordinates I/O transfers.

Recall that the rate of transfer to and from I/O devices is slower than the speed of the processor. This creates the need for mechanisms to synchronize data transfers between them. Program-controlled I/O: Processor repeatedly monitors a status flag to achieve the necessary synchronization. Processor polls the I/O device. Two other mechanisms used for synchronizing data transfers between the processor and memory: Interrupts. Direct Memory Access.

Interrupts In program-controlled I/O, when the processor continuously monitors the status of the device, it does not perform any useful tasks. An alternate approach would be for the I/O device to alert the processor when it becomes ready. Do so by sending a hardware signal called an interrupt to the processor. At least one of the bus control lines, called an interrupt-request line is dedicated for this purpose. Processor can perform other useful tasks while it is waiting for the device to be ready.

Interrupts (contd..) Program 1 Interrupt Service routine here Interrupt occurs M i 2 1 + Processor is executing the instruction located at address i when an interrupt occurs. Routine executed in response to an interrupt request is called the interrupt-service routine. When an interrupt occurs, control must be transferred to the interrupt service routine. But before transferring control, the current contents of the PC (i+1), must be saved in a known location. This will enable the return-from-interrupt instruction to resume execution at i+1. Return address, or the contents of the PC are usually stored on the processor stack.

Interrupts (contd..) Treatment of an interrupt-service routine is very similar to that of a subroutine. However there are significant differences: A subroutine performs a task that is required by the calling program. Interrupt-service routine may not have anything in common with the program it interrupts. Interrupt-service routine and the program that it interrupts may belong to different users. As a result, before branching to the interrupt-service routine, not only the PC, but other information such as condition code flags, and processor registers used by both the interrupted program and the interrupt service routine must be stored. This will enable the interrupted program to resume execution upon return from interrupt service routine.

Interrupts (contd..) Saving and restoring information can be done automatically by the processor or explicitly by program instructions. Saving and restoring registers involves memory transfers: Increases the total execution time. Increases the delay between the time an interrupt request is received, and the start of execution of the interrupt-service routine. This delay is called interrupt latency. In order to reduce the interrupt latency, most processors save only the minimal amount of information: This minimal amount of information includes Program Counter and processor status registers. Any additional information that must be saved, must be saved explicitly by the program instructions at the beginning of the interrupt service routine.

Interrupts (contd..) When a processor receives an interrupt-request, it must branch to the interrupt service routine. It must also inform the device that it has recognized the interrupt request. This can be accomplished in two ways: Some processors have an explicit interrupt-acknowledge control signal for this purpose. In other cases, the data transfer that takes place between the device and the processor can be used to inform the device.

Interrupts (contd..) Interrupt-requests interrupt the execution of a program, and may alter the intended sequence of events: Sometimes such alterations may be undesirable, and must not be allowed. For example, the processor may not want to be interrupted by the same device while executing its interrupt-service routine. Processors generally provide the ability to enable and disable such interruptions as desired. One simple way is to provide machine instructions such as Interrupt-enable and Interrupt-disable for this purpose. To avoid interruption by the same device during the execution of an interrupt service routine: First instruction of an interrupt service routine can be Interrupt-disable. Last instruction of an interrupt service routine can be Interrupt-enable.

Interrupts (contd..) Multiple I/O devices may be connected to the processor and the memory via a bus. Some or all of these devices may be capable of generating interrupt requests. Each device operates independently, and hence no definite order can be imposed on how the devices generate interrupt requests? How does the processor know which device has generated an interrupt? How does the processor know which interrupt service routine needs to be executed? When the processor is executing an interrupt service routine for one device, can other device interrupt the processor? If two interrupt-requests are received simultaneously, then how to break the tie?

Interrupts (contd..) Consider a simple arrangement where all devices send their interrupt-requests over a single control line in the bus. When the processor receives an interrupt request over this control line, how does it know which device is requesting an interrupt? This information is available in the status register of the device requesting an interrupt: The status register of each device has an IRQ bit which it sets to 1 when it requests an interrupt. Interrupt service routine can poll the I/O devices connected to the bus. The first device with IRQ equal to 1 is the one that is serviced. Polling mechanism is easy, but time consuming to query the status bits of all the I/O devices connected to the bus.

Interrupts (contd..) The device requesting an interrupt may identify itself directly to the processor. Device can do so by sending a special code (4 to 8 bits) the processor over the bus. Code supplied by the device may represent a part of the starting address of the interrupt-service routine. The remainder of the starting address is obtained by the processor based on other information such as the range of memory addresses where interrupt service routines are located. Usually the location pointed to by the interrupting device is used to store the starting address of the interrupt-service routine.

Interrupts (contd..) Multiple I/O devices may be connected to the processor and the memory via a bus. Some or all of these devices may be capable of generating interrupt requests. Each device operates independently, and hence no definite order can be imposed on how the devices generate interrupt requests? How does the processor know which device has generated an interrupt? How does the processor know which interrupt service routine needs to be executed? When the processor is executing an interrupt service routine for one device, can other device interrupt the processor? If two interrupt-requests are received simultaneously, then how to break the tie? We have seen how one device can alert the processor using an interrupt to indicate that it is ready for data transfer. Repeat the operation of the interrupt. Usually, there are multiple devices that may be connected to the processor and the memory via a bus. When any one of these devices become ready for data transfer they can generate interrupt requests. Now, each device is capable of generating its own interrupt request, and each one of these devices operate independently. That is, they do not consider what the other devices are doing before generating an interrupt request. As a result, there can be no definite order imposed on how the devices generate interrupt requests. When multiple devices are connected to the processor and the memory as is usually the case, and some or all of these devices are capable of generating interrupt requests, there are several questions that need to be answered. How does the processor know which device has generated the interrupt? The next question is how does the processor know which ISR needs to be executed? The other question is that when the processor is servicing the ISR for one device, can it be interrupted by another device? If two or more interrupt requests are received simultaneously, then how does the processor break the tie between which ISR it is going to execute?

Interrupts (contd..) Consider a simple arrangement where all devices send their interrupt-requests over a single control line in the bus. When the processor receives an interrupt request over this control line, how does it know which device is requesting an interrupt? This information is available in the status register of the device requesting an interrupt: The status register of each device has an IRQ bit which it sets to 1 when it requests an interrupt. Interrupt service routine can poll the I/O devices connected to the bus. The first device with IRQ equal to 1 is the one that is serviced. Polling mechanism is easy, but time consuming to query the status bits of all the I/O devices connected to the bus. Let us consider a simple arrangement where all devices send their interrupt requests using a single control line. Recall that one control line of the bus may be dedicated to send interrupt requests. When the processor receives an interrupt request over the this control, it needs to determine which device actually sent the interrupt. When a device sends an interrupt, it sets a bit in its status register. This bit is called as IRQ bit. So, when the processor receives an interrupt and branches to the ISR, it can poll the IRQ bit of the devices which are connected to the bus. Whenever it comes across an I/O device with an IRQ bit set to 1, then it concludes that that is the device which requested an interrupt. Polling mechanism is easy, but one again, any type of polling is expensive. Here polling is being carried out to query the status bits of all the I/O devices connected to the bus.

Interrupts (contd..) The device requesting an interrupt may identify itself directly to the processor. Device can do so by sending a special code (4 to 8 bits) the processor over the bus. Code supplied by the device may represent a part of the starting address of the interrupt-service routine. The remainder of the starting address is obtained by the processor based on other information such as the range of memory addresses where interrupt service routines are located. Usually the location pointed to by the interrupting device is used to store the starting address of the interrupt-service routine. Repeat polling. An alternative approach may be for the device which is requesting an interrupt to identify itself to the processor. This will avoid the overhead incurred by the polling itself. The device may identify itself to the processor by sending a special code to the processor over the bus. This special code may be 4 to 8 bits. This special code used by the device to identify itself to the processor may also represent a part of the starting address of the ISR. The remainder of the starting address could be obtained in a variety of ways such as the ISRs may reside in a particular section of the memory and hence may use memory addresses in a given range. The code supplied by the device plus other information provides the starting address of the ISR. This starting address actually provides the actual starting address of the ISR.

Interrupts (contd..) Previously, before the processor started executing the interrupt service routine for a device, it disabled the interrupts from the device. In general, same arrangement is used when multiple devices can send interrupt requests to the processor. During the execution of an interrupt service routine of device, the processor does not accept interrupt requests from any other device. Since the interrupt service routines are usually short, the delay that this causes is generally acceptable. However, for certain devices this delay may not be acceptable. Which devices can be allowed to interrupt a processor when it is executing an interrupt service routine of another device? Repeat, that the processor did not want to be interrupted by the same device while it was executing its ISR, it disabled the interrupt at the beginning of the ISR, and it then enabled the interrupts before returning from the ISR. In case of multiple devices, the same arrangement is used. That is, when the processor is executing the ISR of one device, it disables the interrupts not only from that device but also from all the other devices. That is, it does not accept interrupt requests from any other device. Since ISRs are usually short, it takes very little time for their execution, as a result, the delay caused by not accepting interrupts from other devices while servicing an ISR is usually acceptable. However, for some time critical devices this delay that may be caused may be unacceptable. So, that leads us to the question of which devices can interrupt a processor when it is executing an ISR of another device?

Interrupts (contd..) I/O devices are organized in a priority structure: An interrupt request from a high-priority device is accepted while the processor is executing the interrupt service routine of a low priority device. A priority level is assigned to a processor that can be changed under program control. Priority level of a processor is the priority of the program that is currently being executed. When the processor starts executing the interrupt service routine of a device, its priority is raised to that of the device. If the device sending an interrupt request has a higher priority than the processor, the processor accepts the interrupt request. Repeat that multiple I/O devices may be connected to the processor. These multiple I/O devices may be organized according a certain priority. When the processor is servicing an interrupt from a device, only devices which have higher priority can interrupt the processor. That is, only devices which have higher priority can interrupt the processing of the ISR of the device of lower priority. In order to implement this scheme, a priority level is assigned to a processor. This priority level can be changed under program control or it depends on which program is currently being executed by the processor. That is, the priority of the processor is the priority of the program that the processor is currently executing. When the processor receives an interrupt request from a device, and starts executing the ISR of that device, its priority is raised to that of the device. Now, if another device wants to interrupt the processor, then it is allowed to do so, only if its priority is higher than the priority of the processor which is set to the priority of the ISR of the device.

Interrupts (contd..) Processor’s priority is encoded in a few bits of the processor status register. Priority can be changed by instructions that write into the processor status register. Usually, these are privileged instructions, or instructions that can be executed only in the supervisor mode. Privileged instructions cannot be executed in the user mode. Prevents a user program from accidentally or intentionally changing the priority of the processor. If there is an attempt to execute a privileged instruction in the user mode, it causes a special type of interrupt called as privilege exception.

Interrupts (contd..) Priority arbitration Device 1 Device 2 Device p Processor INTA1 I N T R 1 INTA Each device has a separate interrupt-request and interrupt-acknowledge line. Each interrupt-request line is assigned a different priority level. Interrupt requests received over these lines are sent to a priority arbitration circuit in the processor. If the interrupt request has a higher priority level than the priority of the processor, then the request is accepted.

Interrupts (contd..) Which interrupt request does the processor accept if it receives interrupt requests from two or more devices simultaneously?. If the I/O devices are organized in a priority structure, the processor accepts the interrupt request from a device with higher priority. Each device has its own interrupt request and interrupt acknowledge line. A different priority level is assigned to the interrupt request line of each device. However, if the devices share an interrupt request line, then how does the processor decide which interrupt request to accept?

Interrupts (contd..) Polling scheme:
If the processor uses a polling mechanism to poll the status registers of I/O devices to determine which device is requesting an interrupt. In this case the priority is determined by the order in which the devices are polled. The first device with status bit set to 1 is the device whose interrupt request is accepted. Daisy chain scheme: Processor Device 2 I N T R INTA Device n Device 1 Devices are connected to form a daisy chain. Devices share the interrupt-request line, and interrupt-acknowledge line is connected to form a daisy chain. When devices raise an interrupt request, the interrupt-request line is activated. The processor in response activates interrupt-acknowledge. Received by device 1, if device 1 does not need service, it passes the signal to device 2. Device that is electrically closest to the processor has the highest priority.

Interrupts (contd..) When I/O devices were organized into a priority structure, each device had its own interrupt-request and interrupt-acknowledge line. When I/O devices were organized in a daisy chain fashion, the devices shared an interrupt-request line, and the interrupt-acknowledge propagated through the devices. A combination of priority structure and daisy chain scheme can also used. Device circuit Priority arbitration Processor I N T R 1 p INTA1 INTA Devices are organized into groups. Each group is assigned a different priority level. All the devices within a single group share an interrupt-request line, and are connected to form a daisy chain.

Interrupts (contd..) Only those devices that are being used in a program should be allowed to generate interrupt requests. To control which devices are allowed to generate interrupt requests, the interface circuit of each I/O device has an interrupt-enable bit. If the interrupt-enable bit in the device interface is set to 1, then the device is allowed to generate an interrupt-request. Interrupt-enable bit in the device’s interface circuit determines whether the device is allowed to generate an interrupt request. Interrupt-enable bit in the processor status register or the priority structure of the interrupts determines whether a given interrupt will be accepted.

Exceptions Interrupts caused by interrupt-requests sent by I/O devices. Interrupts could be used in many other situations where the execution of one program needs to be suspended and execution of another program needs to be started. In general, the term exception is used to refer to any event that causes an interruption. Interrupt-requests from I/O devices is one type of an exception. Other types of exceptions are: Recovery from errors Debugging Privilege exception

Exceptions (contd..) Many sources of errors in a processor. For example: Error in the data stored. Error during the execution of an instruction. When such errors are detected, exception processing is initiated. Processor takes the same steps as in the case of I/O interrupt-request. It suspends the execution of the current program, and starts executing an exception-service routine. Difference between handling I/O interrupt-request and handling exceptions due to errors: In case of I/O interrupt-request, the processor usually completes the execution of an instruction in progress before branching to the interrupt-service routine. In case of exception processing however, the execution of an instruction in progress usually cannot be completed. Exceptions occur when a processor is trying to recover from errors. Exceptions occur when the processing of one program needs to be suspended and the other one needs to be resumed. When an error occurs while executing a program, the execution of that program needs to be suspended and appropriate error handling needs to be initiated. There are various sources of error in a processor. For example, an error could be present in the data or instruction that is stored. The machine language code of the instruction may be wrong. Also, errors could occur during the execution of an instruction, for example dividing a number by zero causes an exception. When such errors are detected, exception processing is initiated by the processor. In order to initiate exception processing, same steps are taken as in the case of I/O interrupt request. The execution of the present program is suspended, and we start execution an exception service routine. Although exception processing and interrupt request processing is similar, there are subtle differences between handling I/O requests and handling exceptions due to errors. When a processor receives I/O interrupt requests, the processor usually completes the execution of an instruction in progress before branching to the interrupt service routine. In case of exception processing however, an error occurs during the execution of the instruction in progress, and hence the execution of the instruction that caused an exception cannot be completed.

Exceptions (contd..) Debugger uses exceptions to provide important features: Trace, Breakpoints. Trace mode: Exception occurs after the execution of every instruction. Debugging program is used as the exception-service routine. Breakpoints: Exception occurs only at specific points selected by the user. The other scenario where exceptions are used is in the case of debuggers. Debuggers use exceptions to provide important features such as tracing and breakpoints. When the debugger is used in the trace mode, the execution of the program needs to be stopped after every instruction so that the contents of the variables can be examined. The way the trace mode is implemented is that exception occurs after the execution of every instruction, and the debugging program is used as the exception service routine. The debugging program enables you to examine the contents of the variables that you like. In case of breakpoints program execution is halted only at specific points selected by the user. Once again, the debugging program is used as the exception service routine.

Exceptions (contd..) Certain instructions can be executed only when the processor is in the supervisor mode. These are called privileged instructions. If an attempt is made to execute a privileged instruction in the user mode, a privilege exception occurs. Privilege exception causes: Processor to switch to the supervisor mode, Execution of an appropriate exception-servicing routine. Recall privilege mode and supervisor mode. Certain instructions can only be executed in the supervisor mode. If an attempt is made to execute the instruction in the user mode, then an privilege exception occurs. Privilege exception causes the processor to switch to the supervisor mode, and execution of an appropriate service routine.

Direct Memory Access Program-controlled I/O: I/O using interrupts:
Processor polls a status flag in the device interface. Overhead associated with polling the status flag. I/O using interrupts: Processor waits for the device to send an interrupt request. Overhead associated with saving and restoring the Program Counter (PC) and other state information. In addition, if we want to transfer a group of words, the processor needs to execute instructions which increment the memory address and keep track of the word count. To transfer large blocks of data at high speed, an alternative approach is used. Recall program controlled I/O, and the overhead associated with polling the status flag. To circumvent this, we studied I/O using interrupts, where the I/O device alerts the processor. When the interrupt sent by the processor is processed, the processor branches to the interrupt service routine, but before doing so, the processor has to save the contents of the PC and other state information so that the program whose execution is suspended can resume execution after the interrupt service routine completes. This incurs some overhead. Also, if we want to transfer a group of words from an I/O device to the memory, the processor needs to execute instructions which increment the memory address and keep track of the word count. This incurs additional overhead. To transfer large blocks of data at high speed, this approach of having the processor intervene is inefficient. We need an alternative approach by which large blocks of data can be transferred from the I/O device to the memory without having the processor intervene.

Direct Memory Access (contd..)
Direct Memory Access (DMA): A special control unit may be provided to transfer a block of data directly between an I/O device and the main memory, without continuous intervention by the processor. Control unit which performs these transfers is a part of the I/O device’s interface circuit. This control unit is called as a DMA controller. DMA controller performs functions that would be normally carried out by the processor: For each word, it provides the memory address and all the control signals. To transfer a block of data, it increments the memory addresses and keeps track of the number of transfers. This alternative approach is called as direct memory access. DMA consists of a special control unit which is provided to transfer a block of data directly between an I/O device and the main memory without continuous intervention by the processor. A control unit which performs these transfers without the intervention of the processor is a part of the I/O device’s interface circuit, and this controller is called as the DMA controller. DMA controller performs functions that would be normally be performed by the processor. The processor will have to provide a memory address and all the control signals. So, the DMA controller will also provide with the memory address where the data is going to be stored along with the necessary control signals. When a block of data needs to be transferred, the DMA controller will also have to increment the memory addresses and keep track of the number of words that have been transferred.

DMA controller can transfer a block of data from an external device to the processor, without any intervention from the processor. However, the operation of the DMA controller must be under the control of a program executed by the processor. That is, the processor must initiate the DMA transfer. To initiate the DMA transfer, the processor informs the DMA controller of: Starting address, Number of words in the block. Direction of transfer (I/O device to the memory, or memory to the I/O device). Once the DMA controller completes the DMA transfer, it informs the processor by raising an interrupt signal. Repeat DMA controller. DMA controller can be used to transfer a block of data from an external device to the processor, without requiring any help from the processor. As a result the processor is free to execute other programs. However, the DMA controller should perform the task of transferring data to or from an I/O device for a program that is being executed by a processor. That is, the DMA controller does not and should not have the capability to determine when a data transfer operation should take place. The processor must initiate DMA transfer of data, when it is indicated or required by the program that is being executed by the processor. When the processor determines that the program that is being executed requires a DMA transfer, it informs the DMA controller which sits in the interface circuit of the device of three things, namely, the starting address of the memory location, the number of words that needs to be transferred, and the direction of transfer that is, whether the data needs to be transferred from the I/O device to the memory or from the memory to the I/O device. After initiating the DMA transfer, the processor suspends the program that initiated the transfer, and continues with the execution of some other program. The program whose execution is suspended is said to be in the blocked state.

Done IE IRQ Status and control Starting address W ord count R / 31 30 1 DMA controllers have one register to hold the starting address, and one register to hold the word count. A third register called as status register contains status and control bits. R/W = 1 specifies a read operation, R/W = 0 specifies a write operation. When the transfer is complete, the Done bit is set to 1. If IE = 1, the DMA controller raises an interrupt after the transfer is complete. To raise an interrupt, it sets the IRQ bit to 1. Status register may also record other information such as whether the transfer took place correctly, or errors occurred. Repeat DMA controller, it is a part of the device’s I/O circuit. DMA controller has three registers. One register holds the starting address of the memory location that is provided by the processor, and the other register holds the word count or a count of the number of words that needs to be transferred. There is a third register which contains status and control bits. The status and control register has some bits which determine whether it is a read/write operation, and when the operation is complete. The R/W bit specifies a read operation, R/W=0 specifies a read operation, and R/W=0 specifies a write operation. There is a bit called as Done bit which is set to 1 when the transfer is complete. If the IE or the interrupt enable bit is set to 1, then the DMA controller raises an interrupt to inform the processor that the DMA transfer is complete. In order to raise an interrupt, the DMA controller sets the IRQ bit in its status and control register to 1. The status register may also have additional bits to record whether other information such as whether the transfer took place correctly, or whether there were any errors that occurred in the transfer.

Main Processor memory System b us K e yboard Interf ace Netw ork Disk/DMA controller Printer DMA Disk Let us consider a memory organization with two DMA controllers. In this memory organization, a DMA controller is used to connect a high speed network to the computer bus. In addition, disk controller which also controls two disks may have DMA capability. The disk controller controls two disks and it also has DMA capability. The disk controller provides two DMA channels. The disk controller can two independent DMA operations, as if each disk has its own DMA controller. Each DMA controller has three registers, one to store the memory address, one to store the word count, and the last to store the status and control information. There are two copies of these three registers in order to perform independent DMA operations. That is, these registers are duplicated. DMA controller connects a high-speed network to the computer bus. Disk controller, which controls two disks also has DMA capability. It provides two DMA channels. It can perform two independent DMA operations, as if each disk has its own DMA controller. The registers to store the memory address, word count and status and control information are duplicated.

Bus arbitration Processor and DMA controllers both need to initiate data transfers on the bus and access main memory. The device that is allowed to initiate transfers on the bus at any given time is called the bus master. When the current bus master relinquishes its status as the bus master, another device can acquire this status. The process by which the next device to become the bus master is selected and bus mastership is transferred to it is called bus arbitration. Centralized arbitration: A single bus arbiter performs the arbitration. Distributed arbitration: All devices participate in the selection of the next bus master. Processor and DMA controllers both need to initiate data transfers on the bus and access main memory. The process of using the bus to perform a data transfer operation is called as the initiation of a transfer operation. At any point in time only one device is allowed to initiate transfers on the bus. The device that is allowed to initiate transfers on the bus at any given time is called the bus master. When the current bus master releases control of the bus, another device can acquire the status of the bus master. How does one determine which is the next device which will acquire the status of the bus master. Note that there may be several DMA controllers plus the processor which requires access to the bus. The process by which the next device to become the bus master is selected and bus mastership is transferred to it is called bus arbitration. There are two types of bus arbitration processes. Centralized arbitration and distributed arbitration. In case of centralized arbitration, a single bus arbiter performs the arbitration. Whereas in case of distributed arbitration all devices which need to initiate data transfers on the bus participate or are involved in the selection of the next bus master.

Centralized arbitration
Processor DMA controller 1 2 BG1 BG2 B R S Y Bus arbiter may be the processor or a separate unit connected to the bus. Normally, the processor is the bus master, unless it grants bus membership to one of the DMA controllers. DMA controller requests the control of the bus by asserting the Bus Request (BR) line. In response, the processor activates the Bus-Grant1 (BG1) line, indicating that the controller may use the bus when it is free. BG1 signal is connected to all DMA controllers in a daisy chain fashion. BBSY signal is 0, it indicates that the bus is busy. When BBSY becomes 1, the DMA controller which asserted BR can acquire control of the bus.

Centralized arbitration (contd..)
S Y BG1 BG2 Bus master R Processor DMA controller 2 T ime asserts the BR signal. Processor asserts the BG1 signal BG1 signal propagates to DMA#2. Processor relinquishes control of the bus by setting BBSY to 1.

Centralized arbitration (contd..)
Centralized arbitration scheme with one Bus-Request (BR) line and one Bus-Grant (BG) line forming a daisy chain. Several pairs of BR and BG lines are possible, perhaps one per device as in the case of interrupts. Bus arbiter has to ensure that only one request is granted at any given time. It may do so according to a fixed priority scheme, or a rotating priority scheme. Rotating priority scheme: There are four devices, and initial priority is 1,2,3,4. After the request from device 1 is granted, the priority changes to 2,3,4,1.

Distributed arbitration
All devices waiting to use the bus share the responsibility of carrying out the arbitration process. Arbitration process does not depend on a central arbiter and hence distributed arbitration has higher reliability. Each device is assigned a 4-bit ID number. All the devices are connected using 5 lines, 4 arbitration lines to transmit the ID, and one line for the Start-Arbitration signal. To request the bus a device: Asserts the Start-Arbitration signal. Places its 4-bit ID number on the arbitration lines. The pattern that appears on the arbitration lines is the logical-OR of all the 4-bit device IDs placed on the arbitration lines.

Distributed arbitration (contd..)
Device A has the ID 5 and wants to request the bus: - Transmits the pattern 0101 on the arbitration lines. Device B has the ID 6 and wants to request the bus: - Transmits the pattern 0110 on the arbitration lines. Pattern that appears on the arbitration lines is the logical OR of the patterns: - Pattern 0111 appears on the arbitration lines. Arbitration process: Each device compares the pattern that appears on the arbitration lines to its own ID, starting with MSB. If it detects a difference, it transmits 0s on the arbitration lines for that and all lower bit positions. Device A compares its ID 5 with a pattern 0101 to pattern 0111. It detects a difference at bit position 0, as a result, it transmits a pattern 0100 on the arbitration lines. The pattern that appears on the arbitration lines is the logical-OR of 0100 and 0110, which is 0110. This pattern is the same as the device ID of B, and hence B has won the arbitration.

Buses Processor, main memory, and I/O devices are interconnected by means of a bus. Bus provides a communication path for the transfer of data. Bus also includes lines to support interrupts and arbitration. A bus protocol is the set of rules that govern the behavior of various devices connected to the bus, as to when to place information on the bus, when to assert control signals, etc.

Buses (contd..) Bus lines may be grouped into three types:
Data Address Control Control signals specify: Whether it is a read or a write operation. Required size of the data, when several operand sizes (byte, word, long word) are possible. Timing information to indicate when the processor and I/O devices may place data or receive data from the bus. Schemes for timing of data transfers over a bus can be classified into: Synchronous, Asynchronous. Recall that one device plays the role of a master. The device that initiates the data transfer on the bus by issuing read or write control signals is called as a master. The device that is being addressed by the master is called a slave or a target.

Synchronous bus All devices derive timing information from a common clock line. The clock line has equally spaced pulses which define equal time intervals. In a simple synchronous bus, each of these pulses constitutes a bus cycle. One data transfer can take place during one bus cycle. Bus clock Bus cycle

Synchronous bus (contd..)
ime Bus clock Address and command Data t t t 1 2 Bus cycle Master places the device address and command on the bus, and indicates that it is a Read operation. Addressed slave places data on the data lines Master “strobes” the data on the data lines into its input buffer, for a Read operation. In case of a Write operation, the master places the data on the bus along with the address and commands at time t0. The slave strobes the data into its input buffer at time t2.

Once the master places the device address and command on the bus, it takes time for this information to propagate to the devices: This time depends on the physical and electrical characteristics of the bus. Also, all the devices have to be given enough time to decode the address and control signals, so that the addressed slave can place data on the bus. Width of the pulse t1 - t0 depends on: Maximum propagation delay between two devices connected to the bus. Time taken by all the devices to decode the address and control signals, so that the addressed slave can respond at time t1.

At the end of the clock cycle, at time t2, the master strobes the data on the data lines into its input buffer if it’s a Read operation. “Strobe” means to capture the values of the data and store them into a buffer. When data are to be loaded into a storage buffer register, the data should be available for a period longer than the setup time of the device. Width of the pulse t2 - t1 should be longer than: Maximum propagation time of the bus plus Set up time of the input buffer register of the master.

ime Address & command appear on the bus. Bus clock Data reaches the master. Seen by master t AM Address and command Data Address & command reach the slave. t DM Seen by sla v e t AS Data appears on the bus. Address and command Data t DS t t 1 t Signals do not appear on the bus as soon as they are placed on the bus, due to the propagation delay in the interface circuits. Signals reach the devices after a propagation delay which depends on the characteristics of the bus. Data must remain on the bus for some time after t2 equal to the hold time of the buffer. 2

Data transfer has to be completed within one clock cycle. Clock period t2 - t0 must be such that the longest propagation delay on the bus and the slowest device interface must be accommodated. Forces all the devices to operate at the speed of the slowest device. Processor just assumes that the data are available at t2 in case of a Read operation, or are read by the device in case of a Write operation. What if the device is actually failed, and never really responded?

Most buses have control signals to represent a response from the slave. Control signals serve two purposes: Inform the master that the slave has recognized the address, and is ready to participate in a data transfer operation. Enable to adjust the duration of the data transfer operation based on the speed of the participating slaves. High-frequency bus clock is used: Data transfer spans several clock cycles instead of just one clock cycle as in the earlier case.

Address & command requesting a Read operation appear on the bus. 1 2 3 4 Clock Address Command Data Sla v e-ready T ime Master strobes data into the input buffer. Slave-ready signal is an acknowledgement from the slave to the master to confirm that the valid data has been sent. Depending on when the slave-ready signal is asserted, the duration of the data transfer can change. Slave places the data on the bus, and asserts Slave-ready signal. Clock changes are seen by all the devices at the same time.

Clock signal used on the bus is not necessarily the same as the processor clock. Processor clock is much faster than the bus clock. Modern processor clocks are typically above 500 MHz. Clock frequencies used on the memory and I/O buses may be in the range of 50 to 150 MHz.

Asynchronous bus Data transfers on the bus is controlled by a handshake between the master and the slave. Common clock in the synchronous bus case is replaced by two timing control lines: Master-ready, Slave-ready. Master-ready signal is asserted by the master to indicate to the master that it is ready to participate in a data transfer. Slave-ready signal is asserted by the slave in response to the master-ready from the master, and it indicates to the master that the slave is ready to participate in a data transfer.

Asynchronous bus (contd..)
Data transfer using the handshake protocol: Master places the address and command information on the bus. Asserts the Master-ready signal to indicate to the slaves that the address and command information has been placed on the bus. All devices on the bus decode the address. Address slave performs the required operation, and informs the processor it has done so by asserting the Slave-ready signal. Master removes all the signals from the bus, once Slave-ready is asserted. If the operation is a Read operation, Master also strobes the data into its input buffer.

ime Address and command Master-ready Slave-ready Data t t t t t t 1 2 3 4 5 Bus cycle t0 - Master places the address and command information on the bus. Command indicates that it is a Read operation, that is the data are transferred from the device to the memory.

ime Address and command Master-ready Slave-ready Data t t t t t t 1 2 3 4 5 Bus cycle t1 - Master asserts the Master-ready signal. Master-ready signal is asserted at t1 instead of t0 to allow for bus skew. Bus skew occurs when two signals transmitted simultaneously reach the destination at different times. This may occur because different bus lines may have different speeds. t1-t0 should be greater than maximum skew. t1-t0 should also include the time taken to decode the address information.

ime Address and command Master-ready Slave-ready Data t t t t t t 1 2 3 4 5 Bus cycle t2 - Addressed slave places the data on the bus and asserts the Slave-ready signal. The period t2-t1, depends on the propagation delay between the master and the slave, and the delay in the slave’s interface circuit.

ime Address and command Master-ready Slave-ready Data t t t t t t 1 2 3 4 5 Bus cycle t3 - Slave-ready signal arrives at the master. Slave-ready signal was placed on the bus at the same time that data were placed on the bus. As a result, the data may also experience bus skew. The master should wait for maximum bus skew plus the setup time of its input buffer and then strobes the data. It also deactivates the Master-ready signal to indicate that it has received the data.

ime Address and command Master-ready Slave-ready Data t t t t t t 1 2 3 4 5 Bus cycle t4 - Master removes the address and command information. t4-t3 allows for bus skew. Once Master-ready signal is set to 0, it should reach all the devices before the address and command information is removed from the bus.

ime Address and command Master-ready Slave-ready Data t t t t t t 1 2 3 4 5 Bus cycle t5 - Slave receives the transition of the Master-ready signal from 1 to 0. It removes the data and the Slave-ready signal from the bus.

Asynchronous vs. Synchronous bus
Advantages of asynchronous bus: Eliminates the need for synchronization between the sender and the receiver. Can accommodate varying delays automatically, using the Slave-ready signal. Disadvantages of asynchronous bus: Data transfer rate with full handshake is limited by two-round trip delays. Data transfers using a synchronous bus involves only one round trip delay, and hence a synchronous bus can achieve faster rates.

Interface circuits I/O interface consists of the circuitry required to connect an I/O device to a computer bus. Side of the interface which connects to the computer has bus signals for: Address, Data Control Side of the interface which connects to the I/O device has: Datapath and associated controls to transfer data between the interface and the I/O device. This side is called as a “port”. Ports can be classified into two: Parallel port, Serial port.

Interface circuits (contd..)
Parallel port transfers data in the form of a number of bits, normally 8 or 16 to or from the device. Serial port transfers and receives data one bit at a time. Processor communicates with the bus in the same way, whether it is a parallel port or a serial port. Conversion from the parallel to serial and vice versa takes place inside the interface circuit. Parallel port uses a multiple pin connector between the device and the processor, and a cable with as many wires as the number of bits transferred simultaneously. Suitable for devices that are physically close to the computer. Serial port uses a single pin connector to between the device and the processor: Useful for devices that are at a longer distance.

Parallel port V alid Data K e yboard switches Encoder and debouncing circuit SIN Input interf ace Address R / Master -ready Sla v e-ready W D A T AIN Processor Keyboard is connected to a processor using a parallel port. Processor is 32-bits and uses memory-mapped I/O and the asynchronous bus protocol. On the processor side of the interface we have: - Data lines. - Address lines - Control or R/W line. - Master-ready signal and - Slave-ready signal.

Parallel port (contd..) On the keyboard side of the interface:
V alid Data K e yboard switches Encoder and debouncing circuit SIN Input interf ace Address R / Master -ready Sla v e-ready W D A T AIN Processor On the keyboard side of the interface: - Encoder circuit which generates a code for the key pressed. - Debouncing circuit which eliminates the effect of a key bounce (a single key stroke may appear as multiple events to a processor). - Data lines contain the code for the key. - Valid line changes from 0 to 1 when the key is pressed. This causes the code to be loaded into DATAIN and SIN to be set to 1.

Output lines of DATAIN are are connected to the data lines of
K e yboard data V alid Status flag Read- 1 Sla v e- SIN ready A31 A1 A0 Address decoder Q 7 D7 D0 R / W status Master - Output lines of DATAIN are are connected to the data lines of the bus using three-state drivers. Drivers are turned on when the processor issues a read signal and the address selects this register. Status flag is generated using a status flag circuit. It is connected to line D0 of the processor bus using a three-state driver. Address decoder selects the input interface based on bits A1 through A31. Bit A0 determines whether the status or data register is to be read, when Master-ready is active. In response, the processor activates the Slave-ready signal, when either the Read-status or Read-data is equal to 1, which depends on line A0.

Parallel port (contd..) CPU SOUT Output interf ace Data Address R / Master -ready Sla v e-ready V alid W D A T OUT Printer Processor Idle Printer is connected to a processor using a parallel port. Processor is 32 bits, uses memory-mapped I/O and asynchronous bus protocol. On the processor side: - Data lines. - Address lines - Control or R/W line. - Master-ready signal and - Slave-ready signal.

Parallel port (contd..) On the printer side:
CPU SOUT Output interf ace Data Address R / Master -ready Sla v e-ready V alid W D A T OUT Printer Processor Idle On the printer side: - Idle signal line which the printer asserts when it is ready to accept a character. This causes the SOUT flag to be set to 1. - Processor places a new character into a DATAOUT register. - Valid signal, asserted by the interface circuit when it places a new character on the data lines.

Parallel port Data lines of the processor bus
are connected to the DATAOUT register of the interface. The status flag SOUT is connected to the data line D1 using a three-state driver. The three-state driver is turned on, when the control Read-status line is 1. Address decoder selects the output interface using address lines A1 through A31. Address line A0 determines whether the data is to be loaded into the DATAOUT register or status flag is to be read. If the Load-data line is 1, then the Valid line is set to 1. If the Idle line is 1, then the status flag SOUT is set to 1.

Combined I/O interface circuit. Address bits A2 through A31, that is
AIN 1 SIN Ready A31 A1 A0 Address decoder D7 D0 R / W A2 OUT Input status Bus P A7 CA PB7 PB0 CB1 CB2 SOUT D1 RS1 RS0 My-address Handshak e control Master - Sla v e- Combined I/O interface circuit. Address bits A2 through A31, that is 30 bits are used to select the overall interface. Address bits A1 through A0, that is, 2 bits select one of the three registers, namely, DATAIN, DATAOUT, and the status register. Status register contains the flags SIN and SOUT in bits 0 and 1. Data lines PA0 through PA7 connect the input device to the DATAIN register. DATAOUT register connects the data lines on the processor bus to lines PB0 through PB7 which connect to the output device. Separate input and output data lines for connection to an I/O device.

Data lines to I/O device are bidirectional.
P7 Data lines to I/O device are bidirectional. Data lines P7 through P0 can be used for both input, and output. In fact, some lines can be used for input & some for output depending on the pattern in the Data Direction Register (DDR). Processor places an 8-bit pattern into a DDR. If a given bit position in the DDR is 1, the corresponding data line acts as an output line, otherwise it acts as an input line. C1 and C2 control the interaction between the interface circuit and the I/O devices. Ready and Accept lines are the handshake control lines on the processor bus side, and are connected to Master-ready & Slave-ready. Input signal My-address is connected to the output of an address decoder. Three register select lines that allow up to 8 registers to be selected. D A T AIN D0 P0 D A T A OUT Data Direction Re gister My-address RS2 C1 RS1 Status Re gister RS0 and select R / W control Ready C2 Accept INTR

Serial port Serial port is used to connect the processor to I/O devices that require transmission of data one bit at a time. Serial port communicates in a bit-serial fashion on the device side and bit parallel fashion on the bus side. Transformation between the parallel and serial formats is achieved with shift registers that have parallel access capability.

Input shift register accepts input one bit
Serial input Input shift re gister Input shift register accepts input one bit at a time from the I/O device. Once all the 8 bits are received, the contents of the input shift register are loaded in parallel into DATAIN register. Output data in the DATAOUT register are loaded into the output shift register. Bits are shifted out of the output shift register and sent out to the I/O device one bit at a time. As soon as data from the input shift reg. are loaded into DATAIN, it can start accepting another 8 bits of data. Input shift register and DATAIN registers are both used at input so that the input shift register can start receiving another set of 8 bits from the input device after loading the contents to DATAIN, before the processor reads the contents of DATAIN. This is called as double- buffering. D A T AIN D7 D0 My-address D A T A OUT RS1 RS0 Chip and Serial R / W re gister Output shift re gister select Ready Accept Recei ving clock I N T R Status and control ransmission clock T

Serial port (contd..) Serial interfaces require fewer wires, and hence serial transmission is convenient for connecting devices that are physically distant from the computer. Speed of transmission of the data over a serial interface is known as the “bit rate”. Bit rate depends on the nature of the devices connected. In order to accommodate devices with a range of speeds, a serial interface must be able to use a range of clock speeds. Several standard serial interfaces have been developed: Universal Asynchronous Receiver Transmitter (UART) for low-speed serial devices. RS-232-C for connection to communication links.

Standard I/O interfaces
I/O device is connected to a computer using an interface circuit. Do we have to design a different interface for every combination of an I/O device and a computer? A practical approach is to develop standard interfaces and protocols. A personal computer has: A motherboard which houses the processor chip, main memory and some I/O interfaces. A few connectors into which additional interfaces can be plugged. Processor bus is defined by the signals on the processor chip. Devices which require high-speed connection to the processor are connected directly to this bus.

Standard I/O interfaces (contd..)
Because of electrical reasons only a few devices can be connected directly to the processor bus. Motherboard usually provides another bus that can support more devices. Processor bus and the other bus (called as expansion bus) are interconnected by a circuit called “bridge”. Devices connected to the expansion bus experience a small delay in data transfers. Design of a processor bus is closely tied to the architecture of the processor. No uniform standard can be defined. Expansion bus however can have uniform standard defined.

memory Processor Bridge Processor b us PCI b Main Additional controller CD-R OM Disk Disk 1 Disk 2 R CD- SCSI USB V ideo K e yboard Game disk IDE SCSI b ISA interf ace Ethernet Bridge circuit translates signals and protocols from processor bus to PCI bus. Expansion bus on the motherboard

A number of standards have been developed for the expansion bus. Some have evolved by default. For example, IBM’s Industry Standard Architecture. Three widely used bus standards: PCI (Peripheral Component Interconnect) SCSI (Small Computer System Interface) USB (Universal Serial Bus)

Basic concepts Speed of execution of programs can be improved in two ways: Faster circuit technology to build the processor and the memory. Arrange the hardware so that a number of operations can be performed simultaneously. The number of operations performed per second is increased although the elapsed time needed to perform any one operation is not changed. Pipelining is an effective way of organizing concurrent activity in a computer system to improve the speed of execution of programs.

Processor executes a program by fetching and executing instructions one after the other. This is known as sequential execution. If Fi refers to the fetch step, and Ei refers to the execution step of instruction Ii, then sequential execution looks like: T ime F 1 E 2 3 What if the execution of one instruction is overlapped with the fetching of the next one?

Computer has two separate hardware units, one for fetching instructions and one for executing instructions. Instruction is fetched by instruction fetch unit and deposited in an intermediate buffer B1. Buffer enables the instruction execution unit to execute the instruction while the fetch unit is fetching the next instruction. Results of the execution are deposited in the destination location specified by the instruction. Instruction fetch unit Ex ecution Interstage buffer B1

Computer is controlled by a clock whose period is such that the fetch and execute steps of any instruction can be completed in one clock cycle. First clock cycle: - Fetch unit fetches an instruction I1 (F1) and stores it in B1. Second clock cycle: - Fetch unit fetches an instruction I2 (F2) , and execution unit executes instruction I1 (E1). Third clock cycle: - Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2). Fourth clock cycle: - Execution unit executes instruction I3 (E3). F 1 E 2 3 I Instruction Clock cycle 4 T ime

In each clock cycle, the fetch unit fetches the next instruction, while the execution unit executes the current instruction stored in the interstage buffer. Fetch and the execute units can be kept busy all the time. If this pattern of fetch and execute can be sustained for a long time, the completion rate of instruction execution will be twice that achievable by the sequential operation. Fetch and execute units constitute a two-stage pipeline. Each stage performs one step in processing of an instruction. Interstage storage buffer holds the information that needs to be passed from the fetch stage to execute stage. New information gets loaded into the buffer every clock cycle.

Suppose the processing of an instruction is divided into four steps: F Fetch: Read the instruction from the memory. D Decode: Decode the instruction and fetch the source operands. E Execute: Perform the operation specified by the instruction. W Write: Store the result in the destination location. There are four distinct hardware units, for each one of the steps. Information is passed from one unit to the next through an interstage buffer. Three interstage buffers connecting four units. As an instruction progresses through the pipeline, the information needed by the downstream units must be passed along. Interstage b uf fers D : Decode F : Fetch instruction E: Ex ecute W : Write instruction and fetch operation results operands B1 B2 B3

F 4 I 1 2 3 D E W Instruction Clock cycle 5 6 7 T ime Clock cycle 1: F1 Clock cycle 2: D1, F2 Clock cycle 3: E1, D2, F3 Clock cycle 4: W1, E2, D3, F4 Clock cycle 5: W2, E3, D4 Clock cycle 6: W3, E3, D4 Clock cycle 7: W4

F 4 I 1 2 3 D E W Instruction Clock cycle 5 6 7 T ime During clock cycle #4: Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding unit. Instruction I3 was fetched in cycle 3. Buffer B2 holds the source and destination operands for instruction I2. It also holds the information needed for the Write step (W2) of instruction I2. This information will be passed to the stage W in the following clock cycle. Buffer B1 holds the results produced by the execution unit and the destination information for instruction I1.

Role of cache memory Each stage in the pipeline is expected to complete its operation in one clock cycle: Clock period should be sufficient to complete the longest task. Units which complete the tasks early remain idle for the remaining clock period. Tasks being performed in different stages should require about the same amount of time for pipelining to be effective. If instructions are to be fetched from the main memory, the instruction fetch stage would take as much as ten times greater than the other stage operations inside the processor. However, if instructions are to be fetched from the cache memory which is on the processor chip, the time required to fetch the instruction would be more or less similar to the time required for other basic operations.

Pipeline performance Potential increase in performance achieved by using pipelining is proportional to the number of pipeline stages. For example, if the number of pipeline stages is 4, then the rate of instruction processing is 4 times that of sequential execution of instructions. Pipelining does not cause a single instruction to be executed faster, it is the throughput that increases. This rate can be achieved only if the pipelined operation can be sustained without interruption through program instruction. If a pipelined operation cannot be sustained without interruption, the pipeline is said to “stall”. A condition that causes the pipeline to stall is called a “hazard”.

Data hazard Execution of the instruction occurs in the E stage of the pipeline. Execution of most arithmetic and logic operations would take only one clock cycle. However, some operations such as division would take more time to complete. For example, the operation specified in instruction I2 takes three cycles to complete from cycle 4 to cycle 6. F 4 I 1 2 3 D E W Instruction Clock cycle 5 6 7 T ime

Data hazard (contd..) F 4 I 1 2 3 D E W Instruction Clock cycle 5 6 7 T ime Cycles 5 and 6, the Write stage is idle, because it has no data to work with. Information in buffer B2 must be retained till the execution of the instruction I2 is complete. Stage 2, and by extension stage 1 cannot accept new instructions because the information in B1 cannot be overwritten. Steps D6 and F5 must be postponed. A data hazard is a condition in which either the source or the destination operand is not available at the time expected in the pipeline.

Control or instruction hazard
Pipeline may be stalled because an instruction is not available at the expected time. For example, while fetching an instruction a cache miss may occur, and hence the instruction may have to be fetched from the main memory. Fetching the instruction from the main memory takes much longer than fetching the instruction from the cache. Thus, the fetch cycle of the instruction cannot be completed in one cycle. For example, the fetching of instruction I2 results in a cache miss. Thus, F2 takes 4 clock cycles instead of 1. F 1 2 3 I D E W Instruction 4 5 6 7 8 9 Clock c ycle T ime

Control or instruction hazard (contd..)
Fetch operation for instruction I2 results in a cache miss, and the instruction fetch unit must fetch this instruction from the main memory. Suppose fetching instruction I2 from the main memory takes 4 clock cycles. Instruction I2 will be available in buffer B1 at the end of clock cycle 5. The pipeline resumes its normal operation at this point. Decode unit is idle in cycles 3 through 5. Execute unit is idle in cycles 4 through 6. Write unit is idle in cycles 5 through 7. Such idle periods are called as stalls or bubbles. Once created in one of the pipeline stages, a bubble moves downstream unit it reaches the last unit. 1 2 3 4 5 6 7 8 Clock c ycle Stage F: Fetch D: Decode E: Execute W: Write F D idle E W 9 T ime

Structural hazard Two instructions require the use of a hardware resource at the same time. Most common case is in access to the memory: One instruction needs to access the memory as part of the Execute or Write stage. Other instruction is being fetched. If instructions and data reside in the same cache unit, only one instruction can proceed and the other is delayed. Many processors have separate data and instruction caches to avoid this delay. In general, structural hazards can be avoided by providing sufficient resources on the processor chip.

Structural hazard (contd..)
Clock c ycle 1 2 3 4 5 6 7 Instruction I F D E W 1 1 1 1 1 I F D E M W (Load X(R1),R2 2 2 2 2 2 2 I F D E W 3 3 3 3 3 I F D E 4 4 4 4 I F D 5 5 5 Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place in cycle 5, operand read from the memory is written into register R2 in cycle 6. Execution of instruction I2 takes two clock cycles 4 and 5. In cycle 6, both instructions I2 and I3 require access to register file. Pipeline is stalled because the register file cannot handle two operations at once.

Pipelining and performance
Pipelining does not cause an individual instruction to be executed faster, rather, it increases the throughput. Throughput is defined as the rate at which instruction execution is completed. When a hazard occurs, one of the stages in the pipeline cannot complete its operation in one clock cycle. The pipeline stalls causing a degradation in performance. Performance level of one instruction completion in each clock cycle is the upper limit for the throughput that can be achieved in a pipelined processor.

Data hazards Data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed. Consider two instructions: I1: A = 3 + A I2: B = 4 x A If A = 5, and I1 and I2 are executed sequentially, B=32. In a pipelined processor, the execution of I2 can begin before the execution of I1. The value of A used in the execution of I2 will be the original value of 5 leading to an incorrect result. Thus, instructions I1 and I2 depend on each other, because the data used by I2 depends on the results generated by I1. Results obtained using sequential execution of instructions should be the same as the results obtained from pipelined execution. When two instructions depend on each other, they must be performed in the correct order.

Data hazards (contd..) Mul R2, R3, R4 Add R5,R4,R6
1 2 3 4 5 6 7 8 9 Clock c ycle Instruction Mul R2, R3, R4 I F D E W 1 1 1 1 1 I F D D E W Add R5,R4,R6 2 2 2 2A 2 2 I F D E W 3 3 3 3 3 I F D E W 4 4 4 4 4 Mul instruction places the results of the multiply operation in register R4 at the end of clock cycle 4. Register R4 is used as a source operand in the Add instruction. Hence the Decode Unit decoding the Add instruction cannot proceed until the Write step of the first instruction is complete. Data dependency arises because the destination of one instruction is used as a source in the next instruction.

Operand forwarding Data hazard occurs because the destination of one instruction is used as the source in the next instruction. Hence, instruction I2 has to wait for the data to be written in the register file by the Write stage at the end of step W1. However, these data are available at the output of the ALU once the Execute stage completes step E1. Delay can be reduced or even eliminated if the result of instruction I1 can be forwarded directly for use in step E2. This is called “operand forwarding”.

Operand forwarding (contd..)
Re gister file SRC1 SRC2 RSL T Destination Source 1 Source 2 ALU Similar to the three-bus organization. Registers SRC1, SRC2 and RSLT have been added. SRC1, SRC2 and RSLT are interstage buffers for pipelined operation. SRC1 and SRC2 are part of buffer B2. RSLT is part of buffer B3. Data forwarding mechanism is shown by the two red lines. Two multiplexers connected at the inputs to the ALU allow the data on the destination bus to be selected instead of the contents of SRC1 and SRC2 register.

Operand forwarding (contd..)
I1: Mul R2, R3, R4 I2: Add R5, R4, R6 Re gister file SRC1 SRC2 RSL T Destination Source 1 Source 2 ALU Clock cycle 3: - Instruction I2 is decoded, and a data dependency is detected. - Operand not involved in the dependency, register R5 is loaded in register SRC1. Clock cycle 4: - Product produced by I1 is available in register RSLT. - The forwarding connection allows the result to be used in step E2. Instruction I2 proceeds without interruption.

Handling data dependency in software
Data dependency may be detected by the hardware while decoding the instruction: Control hardware may delay by an appropriate number of clock cycles reading of a register till its contents become available. The pipeline stalls for that many number of clock cycles. Detecting data dependencies and handling them can also be accomplished in software. Compiler can introduce the necessary delay by introducing an appropriate number of NOP instructions. For example, if a two-cycle delay is needed between two instructions then two NOP instructions can be introduced between the two instructions. I1: Mul R2, R3, R4 NOP I2: Add R5, R4, R6

Side effects Data dependencies are explicit easy to detect if a register specified as the destination in one instruction is used as a source in the subsequent instruction. However, some instructions also modify registers that are not specified as the destination. For example, in the autoincrement and autodecrement addressing mode, the source register is modified as well. When a location other than the one explicitly specified in the instruction as a destination location is affected, the instruction is said to have a “side effect”. Another example of a side effect is condition code flags which implicitly record the results of the previous instruction, and these results may be used in the subsequent instruction.

Side effects (contd..) I1: Add R3, R4 I2: AddWithCarry R2, R4
Instruction I1 sets the carry flag and instruction I2 uses the carry flag leading to an implicit dependency between the two instructions. Instructions with side effects can lead to multiple data dependencies. Results in a significant increase in the complexity of hardware or software needed to handle the dependencies. Side effects should be kept to a minimum in instruction sets designed for execution on pipelined hardware.

Instruction hazards Instruction fetch units fetch instructions and supply the execution units with a steady stream of instructions. If the stream is interrupted then the pipeline stalls. Stream of instructions may be interrupted because of a cache miss or a branch instruction.

Instruction hazards (contd..)
Consider a two-stage pipeline, first stage is the instruction fetch stage and the second stage is the instruction execute stage. Instructions I1, I2 and I3 are stored at successive memory locations. I2 is a branch instruction with branch target as instruction Ik. I2 is an unconditional branch instruction. Clock cycle 3: - Fetch unit is fetching instruction I3. - Execute unit is decoding I2 and computing the branch target address. Clock cycle 4: - Processor must discard I3 which has been incorrectly fetched and fetch Ik. - Execution unit is idle, and the pipeline stalls for one clock cycle.

F 2 I (Branch) 3 k E k+ 1 +1 Instruction Ex ecution unit idle 4 5 Clock c ycle T ime 6 X Pipeline stalls for one clock cycle. Time lost as a result of a branch instruction is called as branch penalty. Branch penalty is one clock cycle.

Branch penalty depends on the length of the pipeline, may be higher for a longer pipeline. For a four-stage pipeline: - Branch target address is computed in stage E2. - Instructions I3 and I4 have to be discarded. - Execution unit is idle for 2 clock cycles. - Branch penalty is 2 clock cycles. X F 1 D E W I 2 (Branch) 3 4 5 6 7 Clock c ycle k k+ 8 T ime

Branch penalty can be reduced by computing the branch target address earlier in the pipeline. Instruction fetch unit has special hardware to identify a branch instruction after the instruction is fetched. Branch target address can be computed in the Decode stage (D2), rather than in the Execute stage (E2). Branch penalty is only one clock cycle. F 1 D E W I 2 (Branch) 3 4 5 6 7 Clock c ycle X k k+ T ime

F : Fetch instruction E : Ex ecute W : Write results D : Dispatch/ Decode Instruction queue Instruction fetch unit unit Queue can hold several instructions Fetch unit fetches instructions before they are needed & stores them in a queue Dispatch unit takes instructions from the front of the queue and dispatches them to the Execution unit. Dispatch unit also decodes the instruction.

Fetch unit must have sufficient decoding and processing capability to recognize and execute branch instructions. Pipeline stalls because of a data hazard: Dispatch unit cannot issue instructions from the queue. Fetch unit continues to fetch instructions and add them to the queue. Delay in fetching because of a cache miss or a branch: Dispatch unit continues to dispatch instructions from the instruction queue.

Clock c ycle 1 2 3 4 5 6 7 8 9 10 Queue length 1 1 1 1 2 3 2 1 1 1 Initial length of the queue is 1. Fetch adds 1 to the queue, dispatch reduces the length by 1. Queue length remains the same for first 4 clock cycles. I1 stalls the pipeline for 2 cycles. Queue has space, so the fetch unit continues and queue length rises to 3 in clock cycle 6. F D E E E W I 1 1 1 1 1 1 1 I F D E W 2 2 2 2 2 I F D E W 3 3 3 3 3 I F D E W 4 4 4 4 4 I (Branch) F D 5 5 5 I5 is a branch instruction with target instruction Ik. Ik is fetched in cycle 7, and I6 is discarded. However, this does not stall the pipeline, since I4 is dispatched. I F X 6 6 I F D E W k k k k k I F D E k+ 1 k+ 1 k+ 1 k+ 1 I2, I3, I4 and Ik are executed in successive clock cycles. Fetch unit computes the branch address concurrently with the execution of other instructions. This is called as branch folding.

Branch folding can occur if there is at least one instruction available in the queue other than the branch instruction. Queue should ideally be full most of the time. Increasing the rate at which the fetch unit reads instructions from the cache. Most processors allow more than one instruction to be fetched from the cache in one clock cycle. Fetch unit must replenish the queue quickly after a branch has occurred. Instruction queue also mitigates the impact of cache misses: In the event of a cache miss, the dispatch unit continues to send instructions to the execution unit as long as the queue is full. In the meantime, the desired cache block is read. If the queue does not become empty, cache miss has no effect on the rate of instruction execution.

Conditional branches and branch prediction
Conditional branch instructions depend on the result of a preceding instruction. Decision on whether to branch cannot be made until the execution of the preceding instruction is complete. Branch instructions represent 20% of the dynamic instruction count of most programs. Dynamic instruction count takes into consideration that some instructions are executed repeatedly. Branch instructions may incur branch penalty reducing the performance gains expected from pipelining. Several techniques to mitigate the negative impact of branch penalty on performance.

Delayed branch Branch target address is computed in stage E2.
Instructions I3 and I4 have to be discarded. Location following a branch instruction is called a branch delay slot. There may be more than one branch delay slot depending on the time it takes to determine whether the instruction is a branch instruction. In this case, there are two branch delay slots. The instructions in the delay slot are always fetched and at least partially executed before the branch decision is made and the branch address is computed. X F 1 D E W I 2 (Branch) 3 4 5 6 7 Clock c ycle k k+ 8 T ime

Delayed branch (contd..)
Delayed branching can minimize the penalty incurred as a result of conditional branch instructions. Since the instructions in the delay slots are always fetched and partially executed, it is better to arrange for them to be fully executed whether or not branch is taken. If we are able to place useful instructions in these slots, then they will always be executed whether or not the branch is taken. If we cannot place useful instructions in the branch delay slots, then we can fill these slots with NOP instructions.

Add LOOP Shift_left R1 Decrement Branch=0 R2 NEXT (a) Original program loop R1,R3 Register R2 is used as a counter to determine how many times R1 is to be shifted. Processor has a two stage pipeline or one delay slot. Instructions can be reordered so that the shift left instruction appears in the delay slot. Shift left instruction is always executed whether the branch condition is true or false. (b) Reordered instructions

F E Instruction Decrement Branch Shift (delay slot) Decrement (Branch tak en) Add (Branch not tak 1 2 3 4 5 6 7 8 Clock c ycle T ime Shift instruction is executed when the branch is taken. Shift instruction is executed when the branch is not taken.

Logically, the program is executed as if the branch instruction were placed after the shift instruction. Branching takes place one instruction later than where the branch instruction appears in the instruction sequence (with reference to reordered instructions). Hence, this technique is termed as “delayed branch”. Delayed branch requires reordering as many instructions as the number of delay slots. Usually possible to reorganize one instruction to fill one delay slot. Difficult to reorganize two or more instructions to fill two or more delay slots.

Branch prediction To reduce the branch penalty associated with conditional branches, we can predict whether the branch will be taken. Simplest form of branch prediction: Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis. “Speculative execution” implies that the processor is executing instructions before it is certain that they are in the correct sequence. Processor registers and memory locations should not be updated unless the sequence is confirmed. If the branch prediction turns out to be wrong, then instructions that were executed on a speculative basis and their data must be purged. Correct sequence of instructions must be fetched and executed.

Branch prediction (contd..)
F 1 2 I (Compare) (Branch>0) 3 D E W 4 k X Instruction Clock cycle 5 6 /P T ime I1 is a compare instruction and I2 is a branch instruction. Branch prediction takes place in cycle 3 when I2 is being decoded. I3 is being fetched at that time. Fetch unit predicts that the branch will not be taken and continues to fetch I4 in cycle 4 when I3 is being decoded. Results of I1 are available in cycle 3. Fetch unit evaluates branch condition in cycle 4. If the branch prediction is incorrect, the fetch unit realizes at this point. I3 and I4 are discarded and Ik is fetched from the branch target address.

If branch outcomes were random, then the simple approach of always assuming that the branch would not be taken would be correct 50% of the time. However, branch outcomes are not random and it may be possible to determine a priori whether a branch will be taken or not depending on the expected program behavior. For example, a branch instruction at the end of the loop causes a branch to the start of the loop for every pass through the loop except the last one. Better performance can be achieved if this branch is always predicted as taken. A branch instruction at the beginning of the loop causes the branch to be not taken most of the time. Better performance can be achieved if this branch is always predicted as not taken.

Which way to predict the result of the branch instruction (taken or not taken) may be made in the hardware, depending on whether the target address of the branch instruction is lower or higher than the address of the branch instruction. If the target address is lower, then the branch is predicted as taken. If the target address is higher, then the branch is predicted as not taken. Branch prediction can also be handled by the compiler. Complier can set the branch prediction bit to 0 or 1 to indicate the desired behavior. Instruction fetch unit checks the branch prediction bit to predict whether the branch will be taken.

Branch prediction decision is the same every time an instruction is executed. This is “static branch prediction”. Branch prediction decision may change depending on the execution history. This is “dynamic branch prediction”.

Branch prediction algorithms should minimize the probability of making a wrong branch prediction decision. In dynamic branch prediction the processor hardware assesses the likelihood of a given branch being taken by keeping track of branch decisions every time that instruction is executed. Simplest form of execution history used in predicting the outcome of a given branch instruction is the result of the most recent execution of that instruction. Processor assumes that the next time the instruction is executed, the result is likely to be the same. For example, if the branch was taken the last time the instruction was executed, then the branch is likely to be taken this time as well.

Branch prediction algorithm may be described as a two-state machine with 2 states: LT : Branch is likely to be taken LNT: Branch is likely not to be taken Initial state of the machine be LNT When the branch instruction is executed, and if the branch is taken, the machine moves to state LT. If the branch is not taken, it remains in state LNT. When the same branch instruction is executed the next time, the branch is predicted as taken if the state of the machine is LT, else it is predicted as not taken. Branch taken (BT) Branch not taken (BNT) BT BNT LNT LT

Requires only one bit of history information for each branch instruction. Works well inside loops: Once a loop is entered, the branch instruction that controls the looping will always yield the same result until the last pass. In the last pass, the branch prediction will turn out to be incorrect. The branch history state machine will be changed to the opposite state. However, if the same loop is entered the next time, and there is more than one pass, the branch prediction machine will lead to wrong branch prediction. Better performance may be achieved by keeping more execution history.

BT Initial state of the algorithm is LNT. After the branch instruction is executed, if the branch is taken, the state is changed to ST For a branch instruction, the fetch unit predicts that the branch will be taken if the state is ST or LT, else it predicts that the branch will not be taken. In state SNT: - The prediction is that the branch is not taken. - If the branch is actually taken, the state changes to LNT. - Next time the branch is encountered, the prediction again is that it is not taken. - If the prediction is wrong the second time, the state changes to ST. - After that, the branch is predicted as taken. BNT SNT LNT BNT BNT BT BT LT ST BT BNT ST : Strong likely to be taken LT : Likely to be taken LNT : Likely not to be taken SNT : Strong likely not to be taken

Consider a loop with a branch instruction at the end. Initial state of the branch prediction algorithm is LNT. In the first pass, the algorithm will predict that the branch is not taken. This prediction will be incorrect. The state of the algorithm will change to ST. In the subsequent passes, the algorithm will predict that the branch is taken: Prediction will be incorrect, except for the last pass. In the last pass, the branch is not taken: The state will change to LT from ST. When the loop is entered the second time, the algorithm will predict that the branch is taken: This prediction will be correct.

Branch prediction algorithm mispredicts the outcome of the branch instruction in the first pass. The prediction in the first pass depends on the initial state of the branch prediction algorithm. If the initial state can be set correctly, the misprediction in the first pass can be avoided. Information necessary to set the initial state of the branch prediction algorithm can be provided by static prediction schemes. Comparing the branch target address with the address of the branch instruction, Checking the branch prediction bit set by the compiler. Branch instruction at the end of the loop, initial state is LT. Branch instruction at the start of the loop, initial state is LNT. With this, the only misprediction that occurs is on the final pass through the loop. This misprediction is unavoidable.

Superscalar operation
Pipelining enables multiple instructions to be executed concurrently by dividing the execution of an instruction into several stages: Instructions enter the pipeline in strict program order. If the pipeline does not stall, one instruction enters the pipeline and one instruction completes execution in one clock cycle. Maximum throughput of a pipelined processor is one instruction per clock cycle. An alternative approach is to equip the processor with multiple processing units to handle several instructions in parallel in each stage.

Superscalar operation (contd..)
If a processor has multiple processing units then several instructions can start execution in the same clock cycle. Processor is said to use “multiple issue”. These processors are capable of achieving instruction execution throughput of more than one instruction per cycle. These processors are known as “superscalar processors”.

Instruction fetch unit is capable of reading two instructions at a time and storing them in the instruction queue. F : Instruction fetch unit Instruction queue Dispatch unit fetches and retrieves up to two instructions at a time from the front of the queue. Processor has two execution units: Integer and Floating Point Floating- point unit Dispatch W : Write unit results Integer If there is one integer and one floating point instruction, and no hazards, then both instructions are dispatched in the same clock cycle. unit

Various hazards cause a even greater deterioration in performance in case of a superscalar processor. Compiler can avoid many hazards by careful ordering of instructions: For example, the compiler should try to interleave floating-point and integer instructions. Dispatch unit can then dispatch two instructions in most clock cycles, and keep both integer and floating point units busy most of the time. If the compiler can order instructions in such a way that the available hardware units can be kept busy most of the time, high performance can be achieved.

1 (F add) D 2 3 4 E 1A 1B 1C W (Add) (Fsub) (Sub) F 5 6 Clock c ycle 7 Instructions in the floating-point unit take three cycles to execute. Floating-point unit is organized as a three-stage pipeline. Instructions in the integer unit take one cycle to execute. Integer unit is organized as a single-stage pipeline. Clock cycle 1: - Instructions I1 (floating point) and I2 (integer) are fetched. Clock cycle 2: - Instructions I1 and I2 are decoded and dispatched, I3 is fetched.

1 (F add) D 2 3 4 E 1A 1B 1C W (Add) (Fsub) (Sub) F 5 6 Clock c ycle 7 Clock cycle 3: - I1 and I2 begin execution, I2 completes execution. I3 is dispatched to floating - point unit and I4 is dispatched to integer unit. Clock cycle 4: - I1 continues execution, I3 begins execution, I2 completes Write stage, I4 completes execution. Clock cycle 5: - I1 completes execution, I3 continues execution, and I4 completes Write. Order of completion is I2, I4, I1, I3

Out-of-order execution
Instructions are dispatched in the same order as they appear in the program, however, they complete execution out-of-order. Dependencies among instructions need to be handled correctly, so that this does not lead to any problems. What if during the execution of an instruction an exception occurs and one or more of the succeeding instructions have been executed to completion? For example, the execution of instruction I1 may cause an exception after the instruction I2 has completed execution and written the results to the destination location? If a processor permits succeeding instructions to complete execution and write to the destination locations, before knowing whether the prior instructions cause exceptions, it is said to allow “imprecise exceptions”.

Out-of-order execution (contd..)
1 (F add) D 2 3 4 E 1A 1B 1C 3A 3B 3C W (Add) (Fsub) (Sub) 5 6 Clock c ycle F 7 To guarantee a consistent state when exceptions occur, the results of execution must be written to the destination locations strictly in the program order. Step W2 must be delayed until cycle 6, when I1 enters the write stage. Integer unit must retain the results of I2 until cycle 6, and cannot accept another instruction until then. If an exception occurs during an instruction, then all subsequent instructions that may have been partially executed are discarded. This is known a “precise exception.”

Execution completion It is beneficial to allow out-of-order execution, so that the execution unit is freed up to execute other instructions. However, instructions must be completed in program order to allow precise exceptions. These requirements are conflicting. It is possible to resolve the conflict by allowing the execution to proceed and writing the results into temporary registers. The contents of the temporary registers are transferred to permanent registers in correct program order.

Execution completion (contd..)
1 (F add) D 2 3 4 E 1A 1B 1C 3A 3B 3C W (Add) (Fsub) (Sub) 5 6 Clock c ycle TW 7 F Step TW is a write into a temporary register. Step W is the final step in which the contents of the register are transferred into the appropriate permanent register. Step W is called the “commitment step” because the effect of the instruction execution cannot be reversed after that point. Before the commitment step, if any instruction causes an exception, then the results of the succeeding instructions that are still in the temporary registers can be safely discarded.

1 (F add) D 2 3 4 E 1A 1B 1C 3A 3B 3C W (Add) (Fsub) (Sub) 5 6 Clock c ycle TW 7 F Temporary register is given the same name and is treated in the same way as the permanent register whose data it is holding. For example, if the destination register of I2 is R5, then the temporary register used in step TW2 is treated as R5 in the clock cycles 6 and 7. This technique is called as “Register renaming”. If any succeeding instruction refers to R5 during clock cycles 6 and 7, then the contents of the temporary register are forwarded.

A special control unit called “commitment unit” is needed to ensure in-order commitment when out-of-order execution is allowed. Commitment unit has a queue called “reorder buffer” to determine which instructions should be committed next: Instructions are entered in the queue strictly in the program order as they are dispatched for execution. When an instruction reaches the head of the queue and its execution has been completed: Results are transferred from temporary registers to permanent registers. All resources assigned to this instruction are released. The instruction is said to have “retired”. Instructions are retired strictly in program order, though they may be completed out-of-order.

General information CSE 243 : Introduction to Computer Architecture and Hardware /Software Interface. Instructor : Swapna S. Gokhale Phone :

Similar presentations

Presentation on theme: "General information CSE 243 : Introduction to Computer Architecture and Hardware /Software Interface. Instructor : Swapna S. Gokhale Phone :"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

General information CSE 243 : Introduction to Computer Architecture and Hardware /Software Interface. Instructor : Swapna S. Gokhale Phone :

Similar presentations

Presentation on theme: "General information CSE 243 : Introduction to Computer Architecture and Hardware /Software Interface. Instructor : Swapna S. Gokhale Phone :"— Presentation transcript:

Similar presentations

About project

Feedback