An Example Implementation

An Example Implementation
In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic language to make it easier to understand. The language will describe what happens at each clock cycle, rather than being a higher level language. To copy something from one register to another, we will use an assignment statement: MDR = SP To indicate a more complicated operation than passing through the B bus: MDR = H + SP

We must use only legal combinations in the assignment statement (e.g. no H = H - MDR). We can assign the result to multiple registers, so we can also write: SP = MDR = SP + 1 To indicate memory reads and writes of 4-byte data words, we will just put rd and wr in the microinstruction. Fetching a byte through the 1-byte port is indicated by fetch. Assignments and memory operations can occur in the same cycle. This is indicated by writing them on the same line.

Remember that each microinstruction must explicitly supply the address of the next microinstruction. To ease the programmer’s job, the microassembler normally assigns an address to each microinstruction (not necessarily consecutive in the control store) and fills in the NEXT_ADDRESS field so that microinstructions written on consecutive lines are executed consecutively. Sometimes, however, the programmer wants to branch away unconditionally: goto label.

To do a conditional branch, we can use the Z and N flip-flops: Z = TOS tests the TOS register to see if it is zero. We can then do: Z = TOS; if (Z) goto L1; else goto L2 Note that L2 must be exactly 256 more than L1 (the previous instruction causes Z to be ORed into the high-order bit of MPC). To use the JMPC bit: goto (MBR OR value). This tells the microassembler to use value for NEXT_ADDRESS and set the JMPC bit so that MBR is ORed into MPC with NEXT_ADDRESS.

If value is 0, which is the normal case, we can just write: goto (MBR) The actual microprogram that interprets IJVM is 112 microinstructions long. Consecutive microinstructions are not necessarily located in consecutive addresses in the control store. Note the meanings of the registers: CPP is a pointer to the constant pool. LV is a pointer to the local variables. SP is a pointer to the top of the stack. PC holds the address of the next instruction.

MBR is a 1-byte register that holds the bytes of the instruction stream as they come in to be executed. At the beginning and end of each instruction, TOS contains the value of SP. For some instructions, POP for example, more work is necessary. The OPC register is a temporary register. The main loop of the interpreter begins on the line labeled MAIN1 and is a single microinstruction.

Control store addresses corresponding to opcodes must be reserved for the first word of the corresponding instruction interpreter. Assume that MBR contains 0x60 (IADD). The main loop must: Increment the PC, leaving it containing the address of the first byte after the opcode. Initiate a fetch of the next byte into MBR. Perform a multiway branch to the address contained in MBR at the start of Main1. This address is the opcode currently being executed.

The Microprogram

Increasing the Speed In general, we trade off speed versus cost. A faster implementation requires more hardware complexity. There are three basic approaches for increasing the speed of execution: Reduce the number of clock cycles needed to execute an instruction. Simplify the organization so that the clock cycle can be shorter. Overlap the execution of instructions.

Increasing the Speed

Increasing the Speed As another speed increase, notice that for every instruction the following operations may occur: The PC is passed through the ALU and incremented The PC is used to fetch the next byte in the instruction stream. Operands are read from memory. Operands are written to memory. The ALU does a computation and the results are stored back. We can free up the ALU by introducing an IFU (Instruction Fetch Unit).

Instruction Fetch Unit

Pipelining We can further speed up processing by overlapping the execution of the instructions. In order to do this, we must break up the data path into three parts. This is done by inserting latches (registers) in the data path. Each of these data paths executes faster than the original data path.

Pipelining

Cache Memory The recent improvements in CPU speed have led to an even wider gap between CPU and memory speeds. One way to try to manage this problem is through the use of high-speed cache memory. One technique that is quite efficient involves the use of a separate cache for memory and instructions. This is called a split cache.

Cache Memory A split cache allows memory operations to be initiated independently in each cache, doubling the bandwidth of the memory system. Each cache has independent access to the main memory. An additional cache, called level 2 cache, may reside between the instruction and data caches and main memory. There may be three or more levels of cache as more sophisticated memory systems are required.

Cache Memory The following slide shows a typical arrangement.
The CPU chip itself contains a small instruction cache and a small data cache, typically 16 KB to 64 KB. The level 2 cache is not on the CPU chip, but may be included in the CPU package, connected by a high-speed data path. This cache is generally unified (containing both data and instructions) and is between 512 KB and 1 MB.

Cache Memory The third level cache is on the processor board and contains a few megabytes of SRAM, which is faster than the main DRAM memory. Caches are generally inclusive, with the full contents of the level 1 cache being in the level 2 cache and the full contents of the level 2 cache being in the level 3 cache.

Cache Memory

Cache Memory Cache memories depend on two kinds of address locality to achieve their goal. Spatial locality is the observation that memory locations with addresses numerically similar to a recently accessed location are likely to be accessed. Temporal locality occurs when recently accessed memory locations are accessed again. Main memory is divided into fixed-size blocks called cache lines of 4 to 64 bytes. Lines are numbered consecutively starting at 0, so with a 32-byte line size, line 0 is bytes 0 to 31, etc.

Cache Memory At any instant, some lines are in the cache.
When memory is referenced, the cache controller circuit checks to see if the word referenced is currently in the cache. If it is, it is used. If not, some line entry is removed from the cache and the line needed is fetched from memory or some lower level cache to replace it. Many variations exist, but the central idea is always to keep the most heavily-used lines in the cache.

Cache Memory Cache memory can be organized several ways:
Direct-mapped caches fix the cache entry in which a particular portion of main memory can be stored. This is the fastest kind of cache to search, but the most inflexible. Set-associative caches allow a portion of main memory to be stored in one of several cache entries. This allows fast retrieval and flexible storage. A fully-associative cache allows memory to be stored in any entry of the cache. Retrieval is very slow, so this organization is not very practical.

Direct-Mapped Caches Direct mapped caches contain a number (say 2048) of entries. Each entry consists of: The Valid bit which indicated whether there is any valid data in this entry or not. Initially, all entries are marked invalid. The Tag field consists of a unique value (in the example, 16 bits) identifying the corresponding line of memory from which the data came. The Data field contains a copy of the data in memory. This fields holds one cache line of 32 bytes.

Direct-Mapped Caches A memory word can be stored in exactly one place within a direct-mapped cache. Given a memory address, there is only one to place to look for it in the cache. For storing and retrieving data from the cache, the address is broken into 4 components: The TAG field corresponds to the Tag bits stored in a cache entry. The LINE field indicates which entry holds the corresponding data, if they are present.

Direct-Mapped Caches The WORD field tells which word within a line is referenced. The BYTE field is usually not used, but if only a single byte is requested, it tells which byte within the word is needed. For a cache supplying only 32-bit words, this field will always be 0. When the CPU produces an address, the hardware extracts the 11 LINE bits from the address and uses these to index into the cache to find one of the 2048 entries. If the entry is valid, the tags are compared, if they agree, a cache hit has occurred.

Direct-Mapped Caches

Direct-Mapped Caches If the cache entry is invalid or the tags do not match, a cache miss has occurred. In this case, the 32-byte cache line is fetched from memory and stored in the cache entry, replacing what was there. If the existing cache entry has been modified since being loaded, it must be written back to main memory. The retrieval process is made faster by performing the retrieval from the cache in parallel with the comparison of the tags.

Direct-Mapped Caches Up to 64K bytes of contiguous data can be stored in the cache. However, two lines that differ by a multiple of 64K (65,536 bytes) cannot be stored in the cache at the same time. Direct-mapped caches are the most common kind of caches, and they perform quite well since collisions of the kind described above don’t happen often. A compiler can take the cache into account when placing data and instructions in memory.

Set-Associative Caches
A solution to the problem of lines competing for the same cache entry is to allow two or more lines in each cache entry. A cache with n possible entries for each address is called an n-way associative cache. A set associative cache is inherently more complicated than a direct-mapped cache because we need to check up to n tags to see if the needed line is present in an entry. Two-way and four-way caches perform quite well.

The use of a set-associative cache brings up the question of which line should be discarded when a new line is brought into the entry. A good choice is to use the LRU (Least Recently Used) algorithm which replaces the line which was accessed longest ago. If we carry the set-associative idea to the extreme, we have a 2048-way cache or a fully-associative cache. These don’t improve much over the performance of 4-way caches and so are not much used.

Branch Prediction Modern computers are highly pipelined, having up to 10 or more stages. Pipelining works best on linear code, so consecutive words from memory can be read and sent off to be executed. Unfortunately, real code is full of branches. See, for example, the code of the next slide. Two of the five instructions are branches, and the longest linear code sequences here is two instructions.

Branch Prediction

Branch Prediction Even unconditional branches cause problems since we have to decode the instruction (in the pipeline) to see that the instruction is a branch. A number of pipelined machines (such as the UltraSPARC II) have the property that the instruction following an unconditional branch is executed, though logically it should not be. The position after a branch is called a delay slot. The Pentium II does not have this property, but this adds complexity. The compiler tries to put a useful instruction or NOP after the branch.

Branch Prediction Conditional branches are even worse since not only do they have delay slots, but now the fetch unit does not know where to read from until much later in the pipeline. Early pipeline machines just stalled until it was known whether the branch would be taken or not. What most machines do now when they hit a conditional branch is predict whether it will be taken or not. One technique: assume all backward branches are taken and forward ones are not.

Branch Prediction If we guess incorrectly, we have to undo what the (incorrect) instructions have done. We can allow the instructions to continue to execute until they try to change the machine’s state. Instead of overwriting the register the value is put into a (secret) scratch register and only copied to the real register after it is known that the prediction was correct. Alternatively, we can record the value of overwritten registers in a (secret) scratch register and restore them if the prediction turns out to be wrong.

Dynamic Branch Prediction
We can use a history table to record the branches taken or not taken and then consult this table when the branch occurs again. The prediction is simply that the branch will behave the same way it did the time before. This works well, except for the end of loops. To handle this case, we might decide to change the prediction only when it is wrong twice in a row. We can organize the history table in the same way a cache is organized.

Dynamic Branch Prediction

An Example Implementation

Similar presentations

Presentation on theme: "An Example Implementation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Example Implementation

Similar presentations

Presentation on theme: "An Example Implementation"— Presentation transcript:

Similar presentations

About project

Feedback