Obvious design goal: ◦ Construct an implementation with desired functionality Key design challenge: ◦ Simultaneously optimize numerous design metrics Design metric ◦ A measurable feature of a system’s implementation ◦ Optimizing design metrics is a key challenge
Common metrics ◦ Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost ◦ NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system ◦ Size: the physical space required by the system ◦ There are several others such as reliability, ease of use, energy requirements, physical size, etc.
Expertise with both software and hardware is needed to optimize design metrics ◦ Not just a hardware or software expert ◦ A designer must be comfortable with various technologies in order to choose the best for a given application and constraints SizePerformance Power NRE cost Microcontrolle r CCD preprocessorPixel coprocessor A2D D2A JPEG codec DMA controller Memory controllerISA bus interfaceUARTLCD ctrl Display ctrl Multiplier/Accum Digital camera chip lens CCD Hardware Software
A clock is a circuit that emits a series of pulses with a precise pulse width and precise interval between consecutive pulses The interval between the corresponding edges of the two consecutive pulses is known as the clock cycle time A key factor in determining clock speed is the amount of work that must be done in each clock cycle ◦ The more work the longer the cycle ◦ The sequence of operations that must be performed serially in a single clock cycle determines the length of the cycle Even though there are parallel operations transpiring
Two main methods of gaining speed: 1. Hardware: Speed through new technology
2. Organization (given a technology & ISA): Three basic approaches for speeding up execution 1.Reduce the # of clock cycles needed to execute an instruction Reducing # of micro-instructions; path length (for an ISA instruction) 2.Simplify the organization so that the clock cycle can be shorter Adding hardware (does not help as much as expected) Breaking data path into stages 3.Overlap the execution of instructions Separating circuitry for fetching instructions (8 bit memory port, MBR and PC) can be effective Pipelining
How can cost be measured for circuits? Measured in a variety of ways: ◦ Count number of components ◦ The entire processor exists on a single chip Bigger, more complex chips are much more expensive than smaller, simpler ones ◦ Technology used, whether components are custom made or COTS (commercial off the shelf) ◦ The more area required for the functions, the larger the chip Designers use the term “real estate” (area required for a circuit)
Speeding up the circuit with fast components costs money - $$$$$ ◦ A trade-off similar to memory hierarchies ◦ Use a small number of fast parts Those that we determine will be used the most frequently
One can control the amount of decoding While any of the nine registers can be read into the ALU from the B bus ◦ Only 4 bits in the microinstruction are required to specify which register is to be selected ◦ Decoding adds delay
Delays ◦ ALU receives its input slightly delayed ◦ The result is available on the C bus a little later Clock cannot run quite as fast due to the delays ◦ Reducing the control store by 5 bits comes at the cost of reduces clock speed
Best Quote: ◦ “Simple machines are not fast & fast machines are not simple.” A look at our architecture: (Mic-1 CPU) Uses the minimum amount of hardware: ◦ 10 registers ◦ Simple ALU (1 bit ALU replicated 32 times) ◦ Shifter ◦ Decoder ◦ Control store ◦ Some glue
Let’s look at ways to reduce the number of micro-instructions per ISA instruction Recall… each ISA instruction is represented as several micro-code instructions…
One way is to reduce the path length by merging the Interpreter Loop with Microcode
The main loop must be executed at the beginning of every IJVM instruction ◦ It is possible to overlap it with a previous instruction
(four cycles) The sequence above can be reduced to three instructions by merging the main-loop instructions (three cycles) LabelOperations Comment pop1MAR = SP = SP -1;rdRead in the next-to-to on stack pop2 Wait for the new TOS to be read from memory pop3 TOS = MDR; go to Main1 Copy new word to TOS Main1 PC = PC + 1; fetch; go to (MBR) MBR holds OPCODE; get next byte; dispatch LabelOperations Comment pop1MAR = SP = SP -1;rdRead in the next-to-to on stack Main.pop PC = PC + 1; fetch; Wait for the new TOS to be read from memory pop3 TOS = MDR; go to (MBR) Copy new word to TOS
Look at the architecture shown below: Let’s simulate the IADD ISA instruction: Is there a path that could be speed up by adding something?
Add another bus!!!—the A bus No longer need an instruction to simply load the H register ◦ Possible to add any register to any register in one cycle
Using a 3-bus Architecture….. How can the following sequence of micro- instructions for ILOAD can be reduced
The result: By adding addition bus has reduced the total execution time of the ILOAD from six to five cycles What are the apparent trade-offs here?
Cardinal Rule of Computer Design : Make the common case fast What is common about almost all instructions? For every instruction the following may occur: 1.The PC is passed through the ALU and incremented 2.The PC is used to fetch the next byte in the instruction stream 3.Operands are read from memory 4.Operands are written to memory 5.The ALU does a computation & results are stored back How can we improve this? Create an independent unit to fetch and process the instructions: Instruction Fetch Unit (IFU)
Reduce the ALU load Requires an incrementer ◦ Far simpler than an adder or another ALU Can independently increment PC and fetch bytes from the byte stream before they are needed ◦ If an instruction has an operand, it must be explicitly fetched one byte at a time Not having to increment PC in the main loop, helps as generally all we will do is increment PC. Tradeoffs?
Two approaches 1.Interpret each opcode, determine the number of additional fields (operands), fetch and assemble them 2. Take advantage of the stream nature of the instructions and make available at all times the next 8 and 16 bit pieces for immediate use Will discuss the second approach ….
There are now two MBR’s ◦ 8-bit MBR1 and 16-bit MBR-2 The IFU keeps track of the most recent byte(s) consumed by the main execution When MBR1 is read, the next values are shifted into MBR1 & MBR2 ◦ MBR1 holds the oldest byte in the shift register while MBR2 holds the oldest 2 bytes (16 bit integer) ◦ Allows the instructions to use what they need making the next 8- and 16-bit pieces available
Benefits: 1.Eliminates the main loop entirely; each instruction branches directly to the next instruction 2.Avoids tying up ALU incrementing the PC 3.Treats instructions as streams Takes advantage of stream nature of instructions NOTE: Bytes are opcodes & operands; Not all instructions use operands
What about pipelining? ◦ Attempt to make the clock-cycle faster by introducing more parallelism Clock cycle ◦ Recall the clock cycle is limited by the time needed for the signals to propagate through the data-path
There are three major components to the actual data path cycle 1.The time to drive the selected registers onto the A and B buses Registers + A and B Buses 2.The time for the ALU and shifter to do their work ALU/shifter 3.The time for the results to get back to the registers to be stored C Bus Adding parallelism is real opportunity
Steps Dryer (30 minutes) Washing machine (30 minutes) Folding (30 minutes) Putting away (30 minutes) Each step is part of doing one load of laundry How did we pipeline them? How did we know how to pipeline them?
Our data path can also be broken into logical steps USED: 1.Registers + A & B Buses 2.ALU/Shifter 3.C Bus We separate each portion by using latches : flip-flops (registers) One inserted in the middle of each bus
Why do this? What have we gained? ◦ We can speed up the clock because the maximum delay is now shorter ◦ We can use all parts of the data path during every cycle
Now it takes three clock cycles to use the data path ◦ One for loading the A and B latches ◦ One for running the ALU and shifter and loading the C latch ◦ One for storing the C latch back into the registers ◦ Are we worse off now?
First point… ◦ Now we have three smaller data paths with reduced maximum delays clock frequency can be higher ◦ By breaking up the data path into three time intervals (each one is about 1/3 as long), the clock speed can be triple Not quite true since additional registers have been added 2 1 3
Second point… ◦ Throughput (rather than speed) of an individual instruction ◦ Before… 1 micro-instruction = 1 datapath cycle ◦ Now… 1 micro-instruction = (1 datapath cycle) divided into 3 steps For example: look at swap1: before: MAR = SP – 1; rd now: B = SP C = B – 1 MAR = C; rd MDR = Mem We try to issue a new micro-instruction on every cycle, for example use the ALU on every cycle Can use the ALU on every cycle
Pipelined implementation of swap Notice Swap3 Notice Swap3 Depends on the result of Swap1 Called read-after write (RAW) dependence or true dependence
4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers NOTE: although the Mic-3 program takes more cycles than the Mic-2 program, it still runs faster
For the instructions shown below, let’s determine the new micro-instruction sequence if we were to merge the Main1 instruction with each micro-instruction that is performed: goto Main1 What are the trade-offs for doing this?