How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.

Slides:



Advertisements
Similar presentations
COMP375 Computer Architecture and Organization Senior Review.
Advertisements

Arithmetic Logic Unit (ALU)
Computer Organization and Architecture
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
CH12 CPU Structure and Function
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Computer Architecture and the Fetch-Execute Cycle
Computer Architecture and the Fetch-Execute Cycle
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
Structure and Role of a Processor
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Performance improvements ( 1 ) How to improve performance ? Reduce the number of cycles per instruction and/or Simplify the organization so that the clock.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Chapter 1 Computer System Overview
Computer Organization and Architecture + Networks
Chapter 10: Computer systems (1)
Basic Processor Structure/design
Lecture 13 - Introduction to the Central Processing Unit (CPU)
Lecture 5: Computer systems architecture
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Stored program concept
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
CSC 4250 Computer Architectures
Cache Memory Presentation I
Computer Architecture
Teaching Computing to GCSE
Central Processing Unit CPU
Basic Processing Unit Unit- 7 Engineered for Tomorrow CSE, MVJCE.
Functional Units.
Computer Organization and ASSEMBLY LANGUAGE
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Basic Computer Organization
Control Unit Introduction Types Comparison Control Memory
MARIE: An Introduction to a Simple Computer
Today’s agenda Hardware architecture and runtime system
Chapter 5: Computer Systems Organization
Guest Lecturer TA: Shreyas Chand
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
November 5 No exam results today. 9 Classes to go!
Fundamental Concepts Processor fetches one instruction at a time and perform the operation specified. Instructions are fetched from successive memory locations.
Instruction Execution Cycle
Computer Architecture
Computer Architecture
ECE 352 Digital System Fundamentals
1-2 – Central Processing Unit
Chapter 1 Computer System Overview
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Update : about 8~16% are writes
CPU Structure and Function
Chapter 11 Processor Structure and function
Information Representation: Machine Instructions
Computer Architecture
Conceptual execution on a processor which exploits ILP
10/18: Lecture Topics Using spatial locality
Presentation transcript:

How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location at that address The instruction is then “decoded” and executed During execution of each instruction, PC register is incremented by 4 … But *how* exactly? CSE 3430; Part 4

A simple (accumulator) machine 8-bit words, 5-bit address, 3-bit op-code Instructions and op-codes: ADD 000 SUB 001 MPY 010 DIV 011 LOAD 100 STORE 101 In m.l., address is in bits 0 – 4, op-code in 5 – 7 Example code for C = A*B + C*D A in word at 20, B in 21, C in 22, D in 23; word at 30 (E) is used for temporary storage 100 10100 LOAD A 010 10111 MPY D 010 10101 MPY B 000 11110 ADD E 101 11110 STORE E 101 10110 STORE C 100 10110 LOAD C CSE 3430; Part 4

Structure of simple CPU Decode Timing and Control IR OP Addr 2 → 1 MUX INC PC Bus ACC ALU 2 → 1 MUX MAR MDR CSE 3430; Part 4

Structure of simple CPU Bus This bus is internal to the CPU. There is a separate bus from the memory to MAR and MBR CSE 3430; Part 4

MAR is memory address register MBR is memory buffer register To read a word in memory, the CPU must put the address of the word in memory and wait for a certain no. of clock cycles; at the end of that, the value at that memory address will appear in MBR Bus MAR MDR To write a word to memory, the CPU must put the address of the word in memory and the value to be written in MBR; set the “write enable” bit; wait for a certain no. of clock cycles CSE 3430; Part 4

PC is the program counter PC is the program counter. INC is a simple circuit whose output is one greater than its input. The MUX is a multiplexor which will output one of its two inputs, depending on the value of a control signal (not shown); this allows for normal control flow and branches 2 → 1 MUX INC PC Bus MAR MDR CSE 3430; Part 4

ALU is the arithmetic/logic unit and does all the math ACC is the accumulator It can be loaded with a value from the ALU or the bus; the value in it can be used as an input to ALU or copied into MBR (why? when?) 2 → 1 MUX INC PC Bus ACC ALU 2 → 1 MUX MAR MDR CSE 3430; Part 4

IR (instruction reg.) contains the instruction being executed. The decoder splits it into the address and operation to be performed. Timing and control generates the correct control signals and, in effect, runs the whole show Addr Decode OP Timing and Control 2 → 1 MUX INC PC IR Bus ACC ALU 2 → 1 MUX MAR MDR CSE 3430; Part 4

“Timing and control” generates a set of “control signals” that essentially control what happens. Key inputs to TAC: clock, condition signals (from PS) Key idea: At each clock cycle, current state is updated to the appropriate next state and a new set of ctrls signals generated … Condition signals Next-state Current-state (register) Clock Control Control signals Number Operation 0 Acc → bus 1 load Acc 2 PC → bus 3 load PC 4 load IR 5 load MAR 6 MDR → bus 7 load MDR Number Operation 8 ALU → Acc 9 INC → PC 10 ALU operation 11 ALU operation 12 Addr → bus 13 CS 14 R/W

Finally: How the CPU works States 0,1,2: Fetch Rest: Decode, execute PC → bus load MAR INC → PC load PC CS, R/W 1 2 3 4 5 6 8 7 MDR → bus load IR Addr → bus CS OP=store OP=load Yes No ACC → bus load MDR load ACC ALU → ACC ALU op

“Timing and control” generates a set of “control signals” that essentially control what happens. Key inputs to TAC: clock, condition signals (from PS) Key idea: At each clock cycle, current state is updated to the appropriate next state and a new set of ctrls signals generated … Condition signals Next-state Current-state (register) Clock Control Control signals Number Operation 0 Acc → bus 1 load Acc 2 PC → bus 3 load PC 4 load IR 5 load MAR 6 MDR → bus 7 load MDR Number Operation 8 ALU → Acc 9 INC → PC 10 ALU operation 11 ALU operation 12 Addr → bus 13 CS 14 R/W What if we want to handle interrupts? Ans: The interrupt line would feed into Next-state

Improving Performance Problem: Speed mismatch between CPU and memory Memory *can* be fast but then it becomes expensive Solution: Memory hierarchy: (cheaper, slower as you go down list) CPU Registers Cache (Level 1, Level 2, …) Main memory (may be more than one kind) Disk/SSD, … Flash cards, tapes etc. CSE 3430; Part 4

Memory hierachy (contd) Key requirement: Data that CPU needs next must be as high up in the hierarchy as possible Important concept: Locality of reference Temporal locality: A recently executed instruction is likely to be executed again soon Spatial locality: Instructions near a recently executed instruction are likely to be executed soon CSE 3430; Part 4

Cache and Main Memory CPU Cache Main Memory When a Read is received and the word is not in the cache, a block of words containing that word is transferred to cache (one word at a time) Locality of ref. means future requests can probably be met by the cache CPU doesn’t worry about these details … the circuitry in the cache handles them CSE 3430; Part 4

Cache structure & operation Organized as a collection of blocks Ex: Cache of 128 blocks, 16 words/block Mem: 64K words, 16 bit addr: 4K blocks Direct-mapping approach: Block j of mem. → Cache bl. j mod 128 So blocks 0, 128, 256, … of main mem. will all map to cache block 0; etc. Mem. addr.: 5 tag bits+7 block bits+4 word Block bits → the relevant cache block Word bits → which word in block Tag bits → Which of mem. block 0, 128, …? CSE 3430; Part 4

Cache structure & op (contd) When a block (16 words) of memory is stored in the corresponding cache block, also store the tag bits of that mem. block When CPU asks for a word of memory: Cache compares the leftmost 5 bits of addr. with tag bit stored with the corresponding cache block; (“corresponding”?) If it matches, there is a cache “hit”, and we can use copy in cache CSE 3430; Part 4

Cache structure & op (contd) But what if it is a write op? Need to update copy in main mem. as well: Write-through protocol: Update both the value in cache and in memory Update only the cache location but set cache block’s dirty bit to 1 CSE 3430; Part 4

Cache structure & op (contd) What if the word is not in the cache? Need to read the entire block of memory that contains that word, i.e., based on first 12 bits of address, into the right cache block But first: check if dirty bit of that cache block is 1 and, if so, write it back to memory before doing the above This can lead to poor performance -- depending on the degree of spatial/temporal locality of reference CSE 3430; Part 4

Cache structure & op (contd) Associative-mapping approach: A main-memory block may be placed in any cache block Each cache block has a *12 bit* tag that identifies which mem. block is currently mapped to it When an address is received from CPU, the cache compares the first 12 bits with the tag of each cache block to see if there is a match That can be done quite fast (in parallel) CSE 3430; Part 4

Cache structure & op (contd) For anything other than direct-mapping need suitable replacement algorithm Widely used: replace least recently used (LRU) block Surprising: Random replacement does very well Not so surprising: even small caches are useful CSE 3430; Part 4

Cache structure & op (contd) Good measure of effectiveness: hit rate and miss rate These can depend on the program being executed Compilers try to produce code to ensure high hit rates Cache structure can also be tweaked: e.g., have separate “code cache” and “data cache” CSE 3430; Part 4

Improving performance: Pipelining Key idea: Simultaneously perform different stages of consecutive instructions: F(etch), D(ecode), E(xec), W(rite) 1 2 3 4 5 6 7 I1 F1 D1 E1 W1 I2 F2 D2 E2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 W4 Need buffers between stages CSE 3430; Part 4

Pipelining (contd) Need buffers between stages During clock cycle 4: Fetch Instruction Decode ins & Fetch operands Execute operation B3 B2 B1 Write results During clock cycle 4: Buffer B1 holds I3 which was fetched in cycle 3 and is being decoded B2 holds both the source operands for I2 and specification of operation to be performed – produced by decoder in cycle 3; B2 also holds info that will be needed for the write step (in next cycle) of I2 B3 holds results produced by exec unit and the destination info for I1 CSE 3430; Part 4

Potential problems in pipelining Mismatched stages: Different stages require different no. of cycles to finish e.g.: instruction fetch Cache can help address this But what if previous instruction is a branch? That is an instruction hazard Especially problematic for conditional branches Various solutions in both hardware and software (in compilers) have been tried CSE 3430; Part 4

Potential problems in pipelining (contd) Data hazards: if the data needed to execute an instruction is not yet available Maybe data needed has to be computed by previous instruction … can happen even in the case of register operands (how?) Again various solutions have been proposed for dealing with data hazards Important concept: data cache vs. instruction cache Also multiple levels of cache (part of mem. hierarchy) CSE 3430; Part 4

Improving perf.: multiple processors SIMD (single-instruction, multiple-data): One of the earliest: Vector/array processors Control Processor … Broadcast instructions Very useful for matrix computations; likely to be of value in data-analytics applications; GPUs use similar architecture

Improving perf.: multiple processors MIMD: Multiple-instruction, multiple-data i.e., different CPUs executing different instructions on different sets of data Tends to be complex with questions such as how to organize memory Common memory accessible to all processors? (slow) Copy of portion of memory in cache of each processor? (fast but cache coherence?) OS plays an important role in managing such systems Ignoring remaining slides CSE 3430; Part 4

Interrupts? Interrupt controller Interrupt controller CPU Interrupt-in-service Interrupt mask Device 1 Device 2 Device 0 Device 3

Decode Timing and Control IR OP Addr 2 → 1 MUX INC PC Bus ACC ALU 2 → 1 MUX MAR MDR