Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Processor Architectures and Program Mapping 5kk10.

Similar presentations


Presentation on theme: "Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Processor Architectures and Program Mapping 5kk10."— Presentation transcript:

1 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Processor Architectures and Program Mapping 5kk10

2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 flexibility efficiency DS P Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor

3 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 low medium high high medium low flexibility efficiency ASIC GP proc FPGA DSP ASIP

4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 Programmable CPU cores introduction architecture of the MIPS core discussed as an example pipelining application examples software issues comparison between different CPU cores towards application specific architectures discussion

5 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 rationale: as high multiplex factor R as possible consequence: often manual handcrafted design optimised for clock rate problem : fast changes in the IC process technology examples embedded: MIPS (first one, licensing instruction set architecture) ARM (Advanced Risc Machines, telecom, low power, small code size, most popular one, licensing also the micro-architecture as hard or soft IP) Sparc derivatives from general purpose CPUs Intel, NEC, Hitachi, National, PowerPC Introduction

6 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Instruction set architectures implicit operandsexplicit operands stack machines (e.g. ST20) accumulator machines general purpose registers register-memory register-register = load-store Introduction

7 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 C = A + B Introduction

8 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 PC Clk Instruction address Instruction Memory Instruction Rd Rs Rt Imm Architecture of the MIPS core [Hennessy& Patterson] Data Memory Clk Data address Data in 32 Data out Rw Ra Rb bit registers Clk 32

9 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits R - type Op rs rt immediate 6 bits 5 bits 5 bits 16 bits I - type Op target address 6 bits 26 bits J - type opoperation of the instruction rs,rt,rdsource and destination registers shamtshift amount functoperation of the instruction-part 2 immfor program constants addrtarget address of a jump MIPS instruction formats ( 32 bits ) [Hennessy& Patterson]

10 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Example 1 : R - type : add instruction Rw Ra Rb bit registers Clk Result Rd Rs Rt BusA 32 Reg Wr Bus W BusB 32 ALUctr add rd, rs, rt mem[PC] R[rd] = R[rs] + R[rt] PC = PC + 4 [Hennessy& Patterson]

11 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 PC Instruction Memory Rw Ra Rb bit registers Data Memory Clk Data address Data in Data out Instruction address Instruction Rd Rs Rt Imm Critical path R-type operation [Hennessy& Patterson]

12 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 Old value New value Instruction memory access time PC Rs, rt, rd op, funct Old value New value RFile access time Bus A,B Old value New value ALU delay Bus W Set up + skew Clock-to-Q New value Old value Clock Write into RFile Critical path R-type operation

13 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 2 : I-type : load word Rw Ra Rb bit registers Clk Result Rs dc (Rt) BusA 32 Reg Wr Bus W Data In 32 ALUctr RdRt RedDst 32 Extender Imm ALUSrcExtOp WrEn Adr Data Memory Clk MemtoReg MemWr BusB 32 lw rs, rt, imm16 mem[PC] addr = R[rs] + ext[imm16] R[rt] = mem[addr] PC = PC + 4 [Hennessy& Patterson]

14 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman14 Old value New value Instruction memory access time PC Rs, rt, rd op, funct Old value New value RFile access time Bus A,B Old value New value Mem access time Bus W set up+skew Clock-to-Q New value Clock Critical path load operation Old value New value ALU delay address Old value

15 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt immediate 6 bits 5 bits 5 bits 16 bits beq rs, rt, imm16 mem[PC] cond = R[rs] - R[rt] if cond = 0 PC = PC ext(imm16)*4 else PC = PC + 4 Example 3 : I-type : branch [Hennessy& Patterson]

16 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Rw Ra Rb bit registers Clk Rs dc (Rt) BusA 32 Reg Wr Bus W ALUctr RdRt RedDst 32 Extender Imm ALUSrcExtOp BusB 32 Next Address Logic Imm Branch To Instruction Memory PC Clk Zero Example 3 : I-type : branch [Hennessy& Patterson]

17 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 PC Branch Zero 0 1 SignExt Imm Instruction “00” Addr Instruction Memory 30 Clk “1” 32 Instruction Example 3 : I-type : branch [Hennessy&Patterson]

18 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 PC Branch Zero 0 1 SignExt Imm Instruction 00 Addr Instruction Memory 30 Clk “0” 32 Instruction 30 “1” c_in Example 3 : I-type : branch [Hennessy&Patterson]

19 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 problem : long critical path defined by the slowest instruction (load) solution ? = pipelining break the instruction into smaller steps all steps have about the same critical path IfetchRF readALUdmemRF write E.g. load cycle 1cycle 2cycle 3cycle 4cycle 5 5 stages Architecture of the MIPS core

20 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 IfetchRF readALUdmemRF write cycle 1cycle 2cycle 3cycle 4cycle 5cycle 6cycle 7 IfetchRF readALUdmemRF write IfetchRF readALUdmemRF write lw Pipelining lw instructions One instructions enters the pipeline every clock cycle One instructions leaves the pipeline every clock cycle => CPI = 1 (Cycles per Instruction) [Hennessy&Patterson]

21 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 IRAMW InstructionsData IRAMW IRAMWIRAMW IRAMWIRAMW Current CPU cycle Pipelining lw instructions

22 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 IfetchRF readALURF write E.g. ADD 4 stages of R-type instruction cycle 1cycle 2cycle 3cycle 4 [Hennessy&Patterson]

23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 Resource conflict on the write port of the Rfile IfetchRF readALUdmemRF write cycle 1cycle 2cycle 3cycle 4cycle 5cycle 6cycle 7 IfetchRF readALURF write lw add Pipelining lw and R-type instructions [Hennessy&Patterson]

24 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 IfetchRF readALUdmemRF write cycle 1cycle 2cycle 3cycle 4cycle 5cycle 6cycle 7 IfetchRF readALUdmemRF write IfetchRF readALUdmemRF write lw add Solution: stretch R-type to 5 stages IfetchRF readALUdmemRF write Dummy op (noop) [Hennessy&Patterson]

25 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 BusA Din RegDst ext. Imm16 ALUSrc ExtOp Data mem MemtoReg MemWr BusB Ra Rb RwDi Rs Rt Rd adr Prog mem + 4 Dout Rfile flags ALUop branch RegWr Ifetch Reg/dec exec memwr Next PC [Hennessy&Patterson]

26 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 =... … = R Data dependencies : R-type instructions [Hennessy&Patterson]

27 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 =... … = R Data dependencies : R-type instructions Solution: bypasses [Hennessy&Patterson]

28 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 Data mem adr Bypasses [Hennessy&Patterson]

29 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 = lw... … = R Data dependencies : load instruction [Hennessy&Patterson]

30 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 = lw... … = R … = R Data dependencies : load instruction Bypass is no solution for + instruction [Hennessy&Patterson]

31 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 IM RF DM RF IM RF DM RF IM RF DM RF IM RF DM RF R1 = lw... … = R … = R Data dependencies : load instruction Solution: pipeline interlock = detects a data hazard and stalls the pipeline until the hazard is cleared [Hennessy&Patterson]

32 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 IRAMW IR (interlocked) AMW Instructions i1) lw r10, r2, r0 i2) add r8, r9, r10 Data available from data cache i1 i2

33 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 IRAMW IR (interlocked) AMW Instructions i1) MULT r3, r2, r1 i2) ADD r5, r4, r3 i1 i2

34 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 BusA Din ext. Imm16 Data mem BusB Ra Rb RwDi Rs Rt Rd adr Prog mem + 4 Dout Rfile flags branch Next PC [Hennessy&Patterson] IRAMW IRAMW IRAMW IRAMW IRAMW Control hazards

35 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 BusA Din ext. Imm16 Data mem BusB Ra Rb RwDi Rs Rt Rd adr Prog mem + 4 Dout Rfile flags branch Next PC [Hennessy&Patterson] IRAMWIRAMWIRAMW Control hazards 0?

36 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 IRAMWIRAMWIRAMW i1 i2 i3 Address available for instr. fetch i1) beq r10, r2, 1b i2) nop/independent instructions i3) add r8, r9, r10 Control hazards Solution: compiler action possibly filling the branch delay slot

37 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 PR3930 CPU 8K I$ 4K D$ dtag Itag PIO MMU DSU PR3930 IU (including MAD)

38 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 PR peripherals Gfx, SDRAM controller, Serial interconnect bus, I2C, UART, timers PI bus architecture 80 mm2 352 pins 0.35 micron process 48 MHz (96 for gfx) TCP chip: TV controller D$ I$

39 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Programmable CPU cores introduction architecture of the MIPS core discussed as an example pipelining application examples software issues comparison between different CPU cores towards application specific architectures discussion

40 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman40 * Z -1 * * * + c3c3 c4c4 c2c2 c1c1 x4x3x2x1 y Z -1 c0c0 x0 * Application examples (1)

41 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman41 Application examples (1) 19 instructions per tap!!

42 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman42 Bit level operations: finite field arithmetic Application examples (2) 10 instructions!! Very simple in hardware

43 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman43 srl$13, $2, 20 andi$25, $13, 1 srl$14, $2, 21 andi$24, $14, 6 or$15, $25, $24 srl$13, $2, 22 andi$14, $13, 56 or$25, $15, $14 sll$24, $25, source register ($2) destination register ($24) Bit level operations : DES example Application examples (2)

44 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman44 srl$24, $5, 18 srl$25, $5, 17 xor$8, $24, $25 srl$9, $5, 16 xor$10, $8, $9 srl$11, $5, 13 xor$12, $10, $11 andi$13, $12, xor $5 1 $13 … 0... Bit level operations : A5 example (GSM encryption) Application examples (2)

45 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman45 CIF format = 352 * 288 px, 2:1:1, 8 bits/sample QCIF = 1/4 CIF SQCIF = 96*128 Process = 0.25 micron power consumption = Hz Video conferencing H263 96*128*1.5*10Hz = 180 KB/s 20Kb/s :72 Compare 852*576*2B/p *50 =49MB/s Application examples (3)

46 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman IDCTIQ Frame store Motion estimation H.263 video encoder - + in best match Motion comp motion vectors QDCT VLC out Application examples (3)

47 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman47 PR3940 I$D$ memory 10 Hz => 140 MHz CPU Application examples (3)

48 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman48 Application examples (3) In which process can the H263 video encoder be executed on a single MIPS processor ? Conclude: power consumption is limiting factor!!

49 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman49 Application examples: conclusions CPUs offer flexibility, but… not efficient in performance not efficient in code size not efficient in power consumption

50 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman50 func() { a=x.value & 0x3; if (a != 0) { b = a * c + d; } else { b = … ; } y.post(b); } a=x.value & 0x3; b = a * c + d; b = … ; y.post(b); a != 0 a == 0 BB1 BB2 BB4 BB3 parser ldi #0x3, R5 and R4,R5,R6 cmp R0,R6,R7 br R7,true ba false Arch. Model ldi=2 cycles nop =1 cycle... func() { a=x.value & 0x3; DelayCycles(7); if (a != 0) { b = a * c + d; DelayCycles(8); } else { b = … ; DelayCycles(5); } y.post(b); DelayCycles(4); } compile each BB to instructions generate new C with delay counts compile and run

51 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman51 Comparison between different CPU cores

52 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman52 Comparison between different CPU cores

53 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman53 Comparison between different CPU cores

54 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman54 Comparison between different CPU cores

55 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman55 Comparison between different CPU cores

56 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman56 Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of Hardware acceleration Instruction Level Parallelism instead of clock speed Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel

57 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman57 Towards application specific architectures ConCISe [Bernardo Kastrup] Register file ADR1 ADR2 ADW RW RR1 RR2 Encoded instruction word MUX ALU OP2 OP CFU DEC OP1 OP2 32 4

58 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman58 Operand register 1Operand register 2Result registerDEC OUTb12 = IMP1b4 & !IMP2b3 & !dec0 # !IMP1b5 & IMP2b4 & dec3 Example equation for one output bit (12) is shown! Towards application specific architectures

59 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman59 lw$15, 68($sp) srl$24, $15, 16 and$14, $24, 255 sll$25, $14, 24 lw$10, 64($sp) and$9, $10, 255 or$11, $25, $9 srl$24, $10, 16 and$14, $24, 255 sll$25, $14, 8 or$9, $11, $25 and$24, $15, 255 sll$14, $24, 16 or$11, $9, $14 sw$11, 44($sp) Original assembly code or sllor and lw orsll and srl lw sw andsll and srl Data-flow representation lw sw CFUIdec Custom instruction Towards application specific architectures

60 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman60 Hardware/ software partitioning Translator hardware compiler Assembler/ linker Modified assembly with ASIs Hardware netlist Does it fit? Y/N Hardware partition HDL file Source code ConCISe integrated tool-set Profile data Core compiler Simulator executable Benchmark results Assembly code MIPS-ConCISe simulator/ profiler Towards application specific architectures

61 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman61 Advantages: faster execution, smaller code size, lower power The Configurable Functional Unit (CFU) can be: –Standard cell –Field-Programmable Logic (FPL) Considerably bigger in silicon (4 to 5mm 2 in C075) But it’s reconfigurable = reprogrammable for different application programs Towards application specific architectures ConCISe [Bernardo Kastrup]

62 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman62 Some benchmarks

63 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman63 Amdahl’s law Impact of an improvement on the execution time of a program depends on 2 parameters: –f = fraction of the original computation time that is affected by the improvement –s = speedup factor (local) exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) if s >> 1 then speedup_overall = 1 / ( 1 – f ) Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( / 10 ) = 1.56

64 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman64 Towards application specific architectures

65 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman65 Programmable CPU cores are important for the control parts of the application. They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) Keep it Simple heuristic (RISC vs. CISC) Make frequent cases fast and rare cases correct. Regular (orthogonal) instruction set No special features that match a high level language construct. At least 16 registers to ease register allocation. Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Conclusions

66 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman66 Hands-on Implement a FIR filter in assembly and simulate

67 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman67 Hands-on SPIM (MIPS assembly simulator) link from PAM website Use appendix A (same site) example assembly file on PAM website 1 or 2 page report in 2 weeks: –Engineering decisions (eg. Addressing of samples) –Verify that C-code and assembly match –Assembly in appendix –# instructions/tab? Conclusions?


Download ppt "Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman1 Processor Architectures and Program Mapping 5kk10."

Similar presentations


Ads by Google