Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Organization & Design 计算机组成与设计

Similar presentations


Presentation on theme: "Computer Organization & Design 计算机组成与设计"— Presentation transcript:

1 Computer Organization & Design 计算机组成与设计
Weidong Wang (王维东) College of Information Science & Electronic Engineering 信息与通信网络工程研究所(ICAN) Zhejiang University

2 Course Information Instructor: Weidong WANG TA:
Tel(O): ; Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306 Mobile: TA: mobile, 陈 彬彬 Binbin CHEN, ; 陈 佳云 Jiayun CHEN, ; Office Hours: Wednesday & Saturday 14:00-16:30 PM. Xindian (High-Tech) Building 308.(也可以短信邮件联系) 微信号-“计组”

3 Lecture 4 Performance, Energy & Compiler

4 Computer Performance Metrics
Response Time (latency延迟 ) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput吞吐量 How many jobs can the machine run at once? What is the average execution rate? How many queries per minute? If we upgrade a machine with a new processor what to we increase? If we add a new machine to the lab what do we increase?

5 Performance性能 = Execution Time
Elapsed消逝 Time Counts everything (disk and memory accesses, I/O, etc.) A useful number, but often not good for comparison purposes E.g., OS & multiprogramming time make it difficult to compare CPUs CPU time (CPU = Central Processing Unit = processor) Doesn’t count I/O or time spent running other programs Can be broken up into分成 system time, and user time Our focus: user CPU time Time spent executing the lines of code that are “in” our program Includes arithmetic, memory, and control instructions, …

6 Clock Cycles Instead of reporting execution time in seconds, we often use cycles周期数 Clock “ticks” indicate when to start activities Cycle time = time between ticks = seconds per cycle Clock rate (frequency) = cycles per second (1Hz. = 1cycle/sec) A 2GHz clock has a 500 picoseconds (ps) cycle time.

7 Performance and Execution Time
The program should be something real people care about Desktop: MS office, edit, compile Server: web, e-commerce, database Scientific: physics, weather forecasting 程序(Program)是为实现特定目标或解决特定问题而用计算机语言编写的命令序列的集合。

8 Measuring Clock Cycles
Clock cycles/program is not an intuitive直观的 or easily determined value, so Clock Cycles = Instructions x Clock Cycles Per Instruction Cycles Per Instruction (CPI) used often 消费物价指数 ?(Consumer Price Index 物价指数) CPI is an average since the number of cycles per instruction varies from instruction to instruction Average depends on instruction mix, latency of each inst. type etc. CPIs can be used to compare two implementations of the same ISA, but is not useful alone for comparing different ISAs An X86 add is different from a MIPS add

9 Using CPI Drawing on the previous equation:
To improve performance (i.e., reduce execution time) Increase clock rate (decrease clock cycle time) OR Decrease CPI OR Reduce the number of instructions Designers balance cycle time against the number of cycles required Improving one factor may make the other one worse

10 Clock Rate时钟频率 ≠ Performance
Mobile Intel Pentium Vs Intel Pentium M 2.4GHz GHz P4 is 50% faster? Performance on Mobilemark with same memory and disk Word, excel, photoshop, powerpoint, etc. Mobile Pentium 4 is only 15% faster What is the relative CPI? ExecTime = ICCPI/ClockRate IC  CPIM/1.6 = 1.15  IC  CPI4/2.4 CPI4/CPIM = 2.4/(1.15  1.6)=1.3

11 CPI Calculation Different instruction types require different numbers of cycles CPI is often reported for types of instructions where CPIi is the CPI for the type of instructions and ICi is the count of that type of instruction

12 CPI Calculation To compute the overall average CPI use CPI

13 Computing CPI Example 频度 Given this machine, the CPI is the sum of CPIFrequency Average CPI is = 1.5 What fraction of the time for data transfer?

14 What is the Impact of Displacement Based Memory Addressing Mode?
Assume 50% of MIPS loads and stores have a zero displacement.

15 Speedup加速比 Speedup allows us to compare different CPUs or optimizations Example Original CPU takes 2sec to run a program New CPU takes 1.5sec to run a program Speedup = or speedup 33%

16 Amdahl’s Law阿姆德尔定律IBM/1967
If an optimization improves a fraction f of execution time by a factor of a This formula is known as Amdahl’s Law Lessons from If f->100%, then speedup = a If a->∞, the speedup = 1/(1-f) Summary Make the common case fast Watch out for the non-optimized component

17 CPI as an analytical tool to guide design
Machine CPI(throughput, not latency) Program Instruction Mix 5 x x x x x 20 100 = 2.7 cycles/instruction Where program spends its time 20/270

18 Amdahl’s Law (of Diminishing递减 Returns)
Where program spends its time If enhancement “E” makes multiply infinitely fast, but other instructions are unchanged, what is the maximum speedup “S”? S = 1 (post-enhancement %) / 100% 1 48%/100% = = 2.08 Attributed to Gene Amdahl -- “Amdahl’s Law” What is the lesson of Amdahl’s Law? Must enhance computers in a balanced way!

19 Evaluating Performance
Performance best determined by running a real application Use programs typical of expected workload e.g. compiler/editors, scientific applications, graphics, etc. Microbenchmarks微基准 Small programs, synthetic or kernels from larger applications Nice for architects and designers Can be misleading误导 Benchmarks基准 Collection of real programs that companies have agreed on Components: programs, inputs & outputs, measurements rules, metrics Can still be abused滥用

20 Benchmarks基准指标 System Performance Evaluation Cooperative (SPEC)
Scientific computing: Linpack, SpecOMP, SpecHPC,… Embedded benchmarks: EEMBC, Dhrystone, … Enterprise企业级 computing TPC-C, TPC-W, TPC-H SpecJbb, SpecSFS, SpecMail, Streams, Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore) Other 3Dmark, ScienceMark, Winstone, iBench, AquaMark, … Watch out注意: your results will be as good as your benchmarks Make sure you know what the benchmark is designed to measure Performance is not the only metric度量 for computing systems Cost, power consumption, reliability, real-time performance, … 把在VAX-11/780机器上的测试结果1757 Dhrystones/s定义为1 Dhrystone MIPS

21 Summarizing Performance
Combining results from multiple programs into 1 benchmark score Sometimes misleading, always controversial有争议, … and inevitable不可避免 We all like quoting a single number 3 types of means平均 Arithmetic算术: for times Harmonic调和的 : for rates Geometric几何的: for rations

22 Using the Means Arithmetic mean: Harmonic mean:调和平均数 Geometric mean:
When you have individual performance scores in latency 等待时间 Harmonic mean:调和平均数 When you have individual performance scores in throughput 吞吐量/能力 Geometric mean: Nice property: GM(X/Y)= GM(X)/GM(Y) But difficult to related back to execution times Note Always look at the results for individual programs

23 Performance Summary Performance is specific to a particular特定 programs
Total execution time is a consistent一致 summary of performance For a given architecture performance increases come from: Increase in clock rate (without adverse CPI effects) Improvements in processor organization that lower CPI Compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfall陷阱: expecting improvement in one aspect of a machine’s performance to affect the total performance

24 Break Energy

25 Switching Energy: Fundamental Physics
Every logic transition转换 dissipates消耗 energy. Vdd C 1 2 C Vdd E1->0= E0->1= Strong result: Independent of technology. How can we limit switching energy? (1) Slow down clock (fewer transitions). But we like speed ... (2) Reduce Vdd. But lowering Vdd lowers the clock speed ... (3) Fewer circuits. But more transistors can do more work. (4) Reduce C per node. One reason why we scale processes.

26 Scaling switching energy per gate ...
Recall process scaling Due to reducing V and C (length and width of Cs decrease, but plate distance gets smaller). Recent slope more shallow弱的 because V is being scaled less aggressively.激烈 From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.

27 Second Factor: Leakage Currents
Even when a logic gate isn’t switching, it burns power. 0V = Isub: Even when this nFet is off, it passes an Ioff leakage current. We can engineer any Ioff we like, but a lower Ioff also results in a lower Ion, and thus the lower the clock speed. Igate: Ideal capacitors have zero DC current. But modern transistor gates are a few atoms thick, and are not ideal. Intel’s current processor designs, leakage vs switching power A lot of work was done to get a ratio this good /50 is common. Bill Holt, Intel, Hot Chips 17.

28 Engineering “On” Current at 25 nm ...
We can increase Ion by raising Vdd and/or lowering Vt. Vd I ds V g I ds Vs 1.2 mA = Ion 0.25 ≈ Vt Ioff = 0 ??? 0.7 = Vdd

29 Plot on a “Log” Scale to See “Off” Current
We can decrease Ioff by raising Vt - but that lowers Ion. Vd I ds V g I ds 1.2 mA = Ion Vs 0.25 ≈ Vt Ioff ≈ 10 nA 0.7 = Vdd

30 Device engineers trade speed and power
We can reduce CV2 (Pactive) by lowering Vdd. We can increase speed by raising Vdd and lowering Vt. We can reduce leakage (Pstandby) by raising Vt. From: Silicon Device Scaling to the Sub-10-nm Regime Meikei Ieong,1* Bruce Doris,2 Jakub Kedzierski,1 Ken Rim,1 Min Yang1

31 Customize processes for product types ...
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.

32 Intel: Comparing 2 CPU generations ...
Clock speed unchanged ... Lower Vdd, lower C, but more leakage. Design tricks: architecture & circuits. Find enough tricks, and you can afford to raise Vdd a little so that you can raise the clock speed!

33 Gate delay roughly linear with Vdd
And so, we can transform this: Block processes stereo audio. 1/2 of clocks for “left”, 1/2 for “right”. Gate delay roughly linear with Vdd Into this: CV2 power only Ex: Top block processes audio channel 1, bottom block processes audio channel 2. This magic trick brought to you by Cory Hall ...

34 Trade hardware for power, on a large scale ...
Multiple Cores for Low Power Trade hardware for power, on a large scale ...

35 Cell: The PS3 chip

36 Cell: Conventional CPU + 8 “SPUs”
L2 Cache 512 KB PowerPC 8 Synergistic Processing Units (SPUs)

37 One Synergistic协同 Processing Unit (SPU)
256 KB Local Store bit Registers SPU issues 2 inst/cycle (in order) to 7 execution units SPU fills Local Store using DMA to DRAM and network

38 A “Schmoo” plot for a Cell SPU ...
1 2 C Vdd E1->0= E0->1= The lower Vdd, the less dynamic energy consumption. The lower Vdd, the longer the maximum clock period, the slower the clock frequency.

39 Clock speed alone doesn’t help E/op ...
But, lowering clock frequency while keeping voltage constant spreads the same amount of work over a longer time, so chip stays cooler ... 1 2 1 2 E0->1= C Vdd E1->0= C Vdd 2 2

40 Scaling V and f does lower energy/op
1 W to get 2.2 GHz performance. 26 C die temp. 7W to reliably get 4.4 GHz performance. 47C die temp. If a program that needs a 4.4 Ghz CPU can be recoded to use two 2.2 Ghz CPUs ... big win.

41 Intel’s dual-core analysis ...
But only if your app(s) can put 2 cores to use! From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.

42 How iPod puts its 2 cores to use ...
Two 80 MHz CPUs. This chip is used in the most iPods now, with one CPU doing audio decoding, the other doing photos, etc.

43 Other Low-Power Techniques

44 Add “sleep” transistors to logic ...
Example: Floating point unit logic. When running fixed-point instructions, put logic “to sleep”. +++ When “asleep”, leakage power is dramatically reduced. --- Presence of sleep transistors slows down the clock rate when the logic block is in use.

45 Intel example: Sleeping cache blocks
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.

46 Recall: Most logic on a chip “too fast”
Most wires have hundreds of picoseconds to spare. The critical path From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

47 Use several supply voltages on a chip ...
Why? We could use the low-power Vdd for logic well off the critical path. From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.

48 On a CPU, where does the heat go?
Half of the power goes to latches (Flip-Flops). Most of the time, the latches don’t change state. So (gasp) gated clocks are a big win. But, done with CAD tools in a disciplined way. From: Bose, Martonosi, Brooks: Sigmetrics-2001 Tutorial

49 IBM Power 4: How does die heat up?
2 CPUs per die 4 dies on a multi-chip module

50 115 Watts: Concentrated in “hot spots”
Fixed point units Cache logic 66.8 ℃ == 152 F ℃ == F

51 Break Compiler

52 Translation Hierarchy
High-levelAssemblyMachine

53 Compiler编译器 Converts high-level language to machine code
词汇 解析器 句法 语义

54 Intermediate Representation (IR)
中间表示 Three-address code (TAC) t1= 2 * I t2 = t1 + 1 j = t2 t3 = j < n If t3 goto L0 t4 = 2 * I t5 = t4 + 3 j = t5 L0: t6 = a[j] return t6 j= 2 * I + 1; If (j >= n) j = 2*I + 3; return a[j]; Single assignment

55 Optimizing Compilers Provide efficient mapping of program to machine
Code selection and ordering Eliminating minor较小的 inefficiencies低效 Register allocation分配 Don’t (usually) improve asymptotic渐近 efficiency Up to programmer to select best overall algorithm Big-O savings are (often) more important than constant factors But constant factors also matter Optimization types Local: inside basic blocks Global: across basic blocks e.g. loops

56 Limitations of Optimizing Compilers
Operate Under Fundamental Constraints约束条件 Improved code has same output Improved code has same side effects Improved code is as correct as original code Most analysis is performed only within procedures程序 Whole-program analysis is too expensive in most cases Most analysis is based only on static information Compiler has difficulty anticipating预见 run-time inputs Profile driven dynamic compilation becoming more prevalent Dynamic compilation (e.g. Java JIT) When in doubt, the compiler must be conservative谨慎保守的

57 Preview: Code Optimization
Goal: Improve performance by: Removing redundant work Unreachable code Common-subexpression elimination Induction variable归纳变量 elimination Using simpler and faster operations Dealing with constants in the compiler Strength reduction强度折减reformulating less expensive ones Managing register well 后面一一解析 ExecutionTime = Instructions x CPI x ClockCycleTime

58 Constant Folding折叠 Detect & combine values that will be constant.
Respect laws of arithmetic on target machine. a = 10 * – b;

59 Constant Propagation传播
When x gets assigned a constant c, replace users of x with uses of c, as long as x remains unchanged. Very useful for turning constants into immediates, for code generator.

60 Constant Propagation

61 Reduction in Strength Some operations are faster (lower CPI) than others: Multiplication slower than addition Mult or Divide by 2n slower than shift left or right Multiplies take 12 cycle on MIPS R2000 ALU instruction take 1 cycle Can replace complex instruction with equivalent simple instructions Correctness:正确性 Overflow behavior preserved? Other side-effects of operation? (interrupts)

62 Reduction in Strength How to replace multiplication with addition?
Often presents in for loops & array walks. while (i<100) arr[i++] = 0;

63 Common Sub-Expression Elimination

64 Common Sub-Expression Elimination

65 Induction Variable归纳变量 Elimination
while (i<100) arr [i++] = 0; Remember replacing multiply with add in loop? Why not get rid of induction variable I altogether? Induction variable is a variable that is changed by a constant amount on every iteration of a loop.

66 Register Allocation & Assignment
Register provide fastest data access in the processor. Two terms: What variables belong in register? (allocation分配) What register will hold this value? (assignment委派) Once a value is in a register, we’d like not to spill洒出& restore it to use same value again.

67 Assembly Code Generation

68 Compiler Performance on Bubble Sort 冒泡排序

69 Assembler汇编器 Expands macros and pseudo instructions as well as converts constants. Primary purpose is to produce an object目标 file Machine language instructions Application data Information for memory organization

70 Object File目标文件

71 Linker连接器 Linker combines multiple object modules Steps
Identify确定 where code/data will be placed in memory Resolve code/data cross references Produces executable if all references found Steps Place code and data modules in memory Determine the address of data and instruction labels Patch修补 both the internal and external references Separation between compiler and linker makes standard libraries and efficient solution to maintaining modular模块化的 code.

72 Loader装载器 Loader used at run-time
Reads executable file header for size of text/data segments Create address space sufficiently large Copy program from executable on disk into memory Copy arguments参数 to main program’s stack Initialize machine registers and set stack pointer Jump to start-up routine程序 Terminate program when execution completes

73 SPIM System Calls SPIM provides a small number of OS “system calls”
Set system call code in $v0 and pass arguments as needed See web site for details and examples. Service Code Argument Result print_int 1 $a0 = integer print_float 2 $f12 = float print_double 3 $f12 = double print_string 4 $a0 = string read_int 5 Integer in $v0 read_float 6 Float if $f0 read_double 7 Double in $f0 read_string 8 $a0 = buffer, $a1 = length Data in buffer Sbrk 9 $a0 = amount Address in $v0 Exit 10

74 MIPS Assembler Directives指令
SPIM supports a subset of the MIPS assembler directives Some of the directives include .asciiz – Store a null-terminated string in memory .data – Start of data segment .global – Identify an exported symbol .text – Start of text segment .word – Store words in memory

75 Procedure程序Call and Return
Procedures are required for structured programming Aka: functions, methods, subroutines, … Implementing procedures in assembly requires several things to be done Memory space must be set aside for local variables Arguments must be passed in and return values passed out Execution must continue after the call Procedure Call Steps Place parameters in a place where the procedure can access them Transfer control to the procedure Acquire the storage resources needed for the procedure Perform the desired task Place the result value in a place where the calling program can access it Return control to the point of origin

76 MIPS Storgae Layout

77 Procedure Activation Record (Frame)
Each procedure creates an activation record on the stack

78 Register Assignments Calling Convention规矩

79 Caller vs. Callee Saved Registers
Preserved No Preserved Saved register ($s0-$s7) Temporary register ($t0-t9) Stack/frame pointer ($sp, $fp, $gp) Argument registers ($a0-$a3) Return address ($ra) Return values ($v0-$v1) Preserved registers (Callee Save) Save register values on stack prior to use Restore registers before return Not preserved registers (Caller Save) Do what you please and expect callees to do likewise Should be saved by the caller if needed after procedure call

80 Call and Return Caller Callee Setup Callee Return
Save caller-saved register $a0-$a3, $t0-$t9, if necessary Load arguments in $a0-$a3, rest passed on stack Execute jal instruction Callee Setup Allocate memory for new frame ($sp = $sp - frame) Save callee-saved registers $s0-$s7, $fp, $ra, if necessary Set frame pointer ($fp = $sp + frame size - 4) Callee Return Place return value in $v0 and $v1 Restore any callee-saved registers Pop stack ($sp = $sp + frame size) Return by jr $ra

81 Calling Convention惯例Steps

82 Simple Example

83 How to solve a complex problem
Specification Break problem into simpler steps Specify goal in executable forms Algorithm Coding in high language (like C) Translate into ISA Compiler Execute instructions as fast as possible Processor

84 Logic Review

85 How To Build A Processor (or any complex hardware)
Break operation down to steps even gates can understand Generally decompose digital systems into two kinds of operation Things that deal with the real data (Datapath) Things that control the stuff operating on the real data (Control) Find a decomposition分解that is simple, and efficient Some are obvious, others can be more subtle精细的 We will start simple Add stuff材料to improve performance

86 How to Execute Instructions in Five Easy Steps
1) First we need to: Fetch提取 the instruction 2) Then we need to: Decode译码 instruction/fetch register 3) Then we need to: Do the operation运算 4) Then we need to: Write the result into register file 5) Finally we need to: Calculate the next instruction address

87 Implementing实现 A MIPS Implementing a simple MIPS:
Single-cycle per instruction datapath & control logic Subset of Instructions(9 instructions) To simplify our study of processor design, we will focus on a subset of the MIPS instructions Memory: lw and sw Arithmetic: addu, subu, and, ori, and slt Branch: beq and j The method of implementing other instructions should come naturally from these

88 Starting The Design Think of the steps needed for each instruction execution Instructions could take multiple cycles or one cycle Steps that occur on each instruction: 1) Fetch instruction from memory – address is specified by PC 2) Read one or two registers 3) Do add/sub/etc. using ALU 4) Fetch a value from memory 5) Store results to register-file/memory Needed state for single cycle machine: Instruction pointer (PC), 32 register, memory

89 Logic Design Basics Information encoded in binary
Low voltage = 0, High voltage = 1 One wire per bit Multi-bit data encoded on multi-wire buses Combinational element组合(逻辑)元件 Operate on data Output is a function of input If you wait long enough you will get the right answer State (sequential) elements Store information Need to separate signals across clock cycles This is usually done with flip-flops (or latches) We will take about flop-based design in this class

90 Combinational Elements

91 Flipflop, Register, Register file, SRAM, DRAM
Hardware Elements Combinational circuits Mux, Decoder, ALU, ... OpSelect - Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ... Result Comp? A B ALU Sel O A0 A1 An-1 Mux . lg(n) A Decoder . O0 O1 On-1 lg(n) Synchronous state elements Flipflop, Register, Register file, SRAM, DRAM Clk D Q En ff Edge-triggered: Data is sampled at the rising edge CS252 S05

92 Inverters: A simple transistor model
pFET. A switch. “On” if gate is grounded. “1” “0” This model is too simple to be useful ... “1” “0” “1” “0” nFET. A switch. “On” if gate is at Vdd.

93 Transistors as water valves阀门
If electrons are water molecules分子, and a capacitor a bucket水桶... A “on” p-FET fills up the capacitor with charge. “1” “0” Time Water level “0” “1” Time Water level A “on” n-FET empties the bucket. This model is often good enough ...

94 What is the bucket水桶? A gate’s “fan-out”.
“Fan-out扇出”: The number of gate inputs driven by a gate’s output. Driving other gates slows a gate down. Driving wires slows a gate down.

95 Why we call it “fan-out”

96 A closer look at fan-out ...
Driving more gates adds delay. Linear model works for reasonable fan-out

97 Propagation delay graphs ...
1->0

98 Intuition直觉: Critical paths ...
T2 might be the critical (worst-case delay) path. If d going 0-to-1 switches x 0-to-1, delay is T1. If a going 0-to-1 switches x 0-to-1, delay is T2. It would be surprising if T1 > T2. T1 T2 x = g(a, b, c, d, e, f)

99 Why “might”? Wires have delay too ...
Looks benign仁慈的, but ...

100 Clocking Methodology AS’ AS
Combinational logic transforms data during clock cycles Between clock edges Input from state elements, output to state element Longest delay determines clock period AS’ AS Sequential Logic (State) Combinational

101 Sequential Elements Flip-flops (or registers): stores data in a circuit Uses a clock signal to determine when to update the stored value Edge-triggered: update when Clk changes from 0 to 1

102 Sequential Elements Flip-flops with write control
Only updates on clock edge when write control input is 1 Used when stored value is required later

103 Single cycle data paths: Assumptions假定
Processor uses synchronous logic design (a “clock”). f T 1 MHz 1 μs 10 MHz 100 ns 100 MHz 10 ns 1 GHz 1 ns All state elements act like positive edge-triggered flip flops. D Q clk

104 Edge-Triggered D Flip Flops
Q Value of D is sampled on positive clock edge. Q outputs sampled value for rest of cycle. CLK D Q

105 Review: Edge-Triggering in Verilog
Q Value of D is sampled on positive clock edge. Q outputs sampled value for rest of cycle. module ff(D, Q, CLK); input D, CLK; output Q; (CLK) Q <= D; endmodule Module code has two bugs. Where? CLK

106 Review: Edge-Triggered D Flip Flops
Q CLK Value of D is sampled on positive clock edge. Q outputs sampled value for rest of cycle. module ff(D, Q, CLK); input D, CLK; output Q; reg Q; (posedge CLK) Q <= D; endmodule

107 Timing Analysis and Logic Delay
Register: An Array of Flip-Flops Combinational Logic If T > worst-case delay through CL, does this ensure correct operation?

108 Flip Flops have internal delays ...
Q Value of D is sampled on positive clock edge. Q outputs sampled value for rest of cycle. t_setup CLK t_clk-to-Q D Q

109 Flip-Flop delays eat into “time budget”
Combinational Logic ALU “time budget”

110 Clock skew歪斜also eats into “time budget”
时间预算 CLKd CLKd CLKd As T →0, which circuit fails first?

111 Some Flip Flops have “hold” time ...
t_inv What is the intended function of this circuit? t_setup t_hold D must stay stable here CLK D Q D Does flip-flop hold time affect operation of this circuit? Under what conditions? CLK t_clk-to-Q + t_inv > t_hold For correct operation.

112 Critical Timing Issues
Flops work great as long as input is stable when clock rises Called setup and hold windows Clock skew can cause some nasty令人讨厌的problems Hold time violations违背 Cycle Time = longest Prop Delay + Setup + Clock Skew

113 HomeWork Readings: HW1 Read Chapter 3, 4;

114 Acknowledgements These slides contain material from courses:
UCB CS152. Stanford EE108B


Download ppt "Computer Organization & Design 计算机组成与设计"

Similar presentations


Ads by Google