Presentation on theme: "Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an."— Presentation transcript:
Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an
Review of Computer Architecture zCredit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. zI made some modifications to the note for clarity. yAssume some background information from CSCE 430 or equivalent
von Neumann architecture zMemory holds data and instructions. zCentral processing unit (CPU) fetches instructions from memory. ySeparate CPU and memory distinguishes programmable computer. zCPU registers help out: program counter (PC), instruction register (IR), general- purpose registers, etc.
von Neumann Architecture Memory Unit CPU Control + ALU Output Unit Input Unit
CPU + memory memory CPU PC address data IRADD r5,r1,r3 200 ADD r5,r1,r3
What is a potential Problem with von Neumann Architecture?
Harvard architecture CPU PC data memory program memory address data address data
von Neumann vs. Harvard zHarvard can’t use self-modifying code. zHarvard allows two simultaneous memory fetches. zMost DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: ygreater memory bandwidth. ydifferent memory bit depths between instruction and data. ymore predictable bandwidth.
Today’s Processors Harvard or von Neumann?
RISC vs. CISC zComplex instruction set computer (CISC): ymany addressing modes; ymany operations. zReduced instruction set computer (RISC): yload/store; ypipelinable instructions.
Instruction set characteristics zFixed vs. variable length. zAddressing modes. zNumber of operands. zTypes of operands.
Tensilica Xtensa zRISC based variable length yBut not CISC
Programming model zProgramming model: registers visible to the programmer. zSome registers are not visible (IR).
Multiple implementations zSuccessful architectures have several implementations: yvarying clock speeds; ydifferent bus widths; ydifferent cache sizes, associativities, configurations; ylocal memory, etc.
Assembly language zOne-to-one with instructions (more or less). zBasic features: yOne instruction per line. yLabels provide names for addresses (usually in first column). yInstructions often start in later columns. yColumns run to end of line.
ARM assembly language example label1ADR r4,c LDR r0,[r4] ; a comment ADR r4,d LDR r1,[r4] SUB r0,r0,r1 ; comment destination
Pseudo-ops zSome assembler directives don’t correspond directly to instructions: yDefine current address. yReserve storage. yConstants.
Pipelining zExecute several instructions simultaneously but at different stages. zSimple three-stage pipe: fetchdecodeexecutememory
Pipeline complications zMay not always be able to predict the next instruction: yConditional branch. zCauses bubble in the pipeline: fetchdecode Execute JNZ fetchdecodeexecute fetchdecodeexecute
Superscalar zRISC pipeline executes one instruction per clock cycle (usually). zSuperscalar machines execute multiple instructions per clock cycle. yFaster execution. yMore variability in execution times. yMore expensive CPU.
Simple superscalar zExecute floating point and integer instruction at the same time. yUse different registers. yFloating point operations use their own hardware unit. zMust wait for completion when floating point, integer units communicate.
Costs zGood news---can find parallelism at run time. yBad news---causes variations in execution time. zRequires a lot of hardware. yn 2 instruction unit hardware for n-instruction parallelism.
Finding parallelism zIndependent operations can be performed in parallel: ADD r0, r0, r1 ADD r3, r2, r3 ADD r6, r4, r0 ++ + r0 r1r2r3 r0 r4 r6 r3
Pipeline hazards Two operations that have data dependency cannot be executed in parallel: x = a + b; a = d + e; y = a - f; - + + x a b d e a y f
Order of execution zIn-order: yMachine stops issuing instructions when the next instruction can’t be dispatched. zOut-of-order: yMachine will change order of instructions to keep dispatching. ySubstantially faster but also more complex.
VLIW architectures zVery long instruction word (VLIW) processing provides significant parallelism. zRely on compilers to identify parallelism.
What is VLIW? zParallel function units with shared register file: register file function unit function unit function unit function unit... instruction decode and memory
VLIW cluster zOrganized into clusters to accommodate available register bandwidth: cluster...
VLIW and compilers zVLIW requires considerably more sophisticated compiler technology than traditional architectures---must be able to extract parallelism to keep the instructions full. zMany VLIWs have good compiler support.
Scheduling ab c d ef g abe fc dg nop expressionsinstructions
EPIC zEPIC = Explicitly parallel instruction computing. zUsed in Intel/HP Merced (IA-64) machine. zIncorporates several features to allow machine to find, exploit increased parallelism.
IA-64 instruction format zInstructions are bundled with tag to indicate which instructions can be executed in parallel: taginstruction 1instruction 2instruction 3 128 bits
Memory system zCPU fetches data, instructions from a memory hierarchy: Main memory L2 cache L1 cache CPU
Memory hierarchy complications zProgram behavior is much more state- dependent. yDepends on how earlier execution left the cache. zExecution time is less predictable. yMemory access times can vary by 100X.