Performance Evaluation May 2, 2000 Topics Getting accurate measurements Amdahl’s Law 15-213 class29.ppt.

Slides:



Advertisements
Similar presentations
Time Measurement Topics Time scales Interval counting Cycle counters K-best measurement scheme time.ppt CS 105 Tour of Black Holes of Computing.
Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
CS1104: Computer Organisation School of Computing National University of Singapore.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Computer Organization and Architecture 18 th March, 2008.
Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,
Performance Evaluation I November 5, 1998 Topics Performance measures (metrics) Timing Profiling class22.ppt.
1 Lecture 6 Performance Measurement and Improvement.
Time Measurement October 25, 2001 Topics Time scales Interval counting Cycle counters K-best measurement scheme class18.ppt.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
1 Homework Reading –PAL, pp , Machine Projects –Finish mp2warmup Questions? –Start mp2 as soon as possible Labs –Continue labs with your.
Time Measurement CS 201 Gerson Robboy Portland State University Topics Time scales Interval counting Cycle counters K-best measurement scheme.
Chapter 4 Assessing and Understanding Performance
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Time Measurement Topics Time scales Interval counting Cycle counters K-best measurement scheme class18.ppt.
Time Measurement December 9, 2004 Topics Time scales Interval counting Cycle counters K-best measurement scheme Related tools & ideas class29.ppt
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Virtual Memory.
Instructor: Erol Sahin Hypertreading, Multi-core architectures, Amdahl’s Law and Review CENG331: Introduction to Computer Systems 14 th Lecture Acknowledgement:
1 Computer Performance: Metrics, Measurement, & Evaluation.
Assembly Questions תרגול 12.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
Computer Architecture
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
Performance Performance
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
CS 1104 Help Session V Performance Analysis Colin Tan, S
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
Lecture 5: 9/10/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Lec2.1 Computer Architecture Chapter 2 The Role of Performance.
Performance Evaluation II November 10, 1998 Topics Amdahl’s Law Benchmarking (lying with numbers) class23.ppt.
Timing Tutorial #5 CPSC 261. Clocks Most computer systems have a number of clocks – Give time signals to the pipeline and memory Tick at 2.67 GHz (or.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
Computer Architecture CSE 3322 Web Site crystal.uta.edu/~jpatters/cse3322 Send to Pramod Kumar, with the names and s.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Measuring Time Alan L. Cox Some slides adapted from CMU slides.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
Performance. Moore's Law Moore's Law Related Curves.
Chapter Overview General Concepts IA-32 Processor Architecture
Instruction Set Architecture
September 2 Performance Read 3.1 through 3.4 for Tuesday
Homework Reading Machine Projects Labs
William Stallings Computer Organization and Architecture 8th Edition
CS 3114 Many of the following slides are taken with permission from
CS2100 Computer Organisation
Introduction to Computer Systems
Introduction to Computer Systems
System Calls David Ferry CSCI 3500 – Operating Systems
Recitation 6: Cache Access Patterns
CS 3114 Many of the following slides are taken with permission from
CS2100 Computer Organisation
Presentation transcript:

Performance Evaluation May 2, 2000 Topics Getting accurate measurements Amdahl’s Law class29.ppt

CS 213 S’00– 2 – class29.ppt “Time” on a Computer System real (wall clock) time = user time (time executing instructing instructions in the user process) += real (wall clock) time We will use the word “time” to refer to user time. = system time (time executing instructing instructions in kernel on behalf of user process) + = some other user’s time (time executing instructing instructions in different user’s process)

CS 213 S’00– 3 – class29.ppt Time of Day Clock return elapsed time since some reference time (e.g., Jan 1, 1970) example: Unix gettimeofday() command coarse grained (e.g., ~3  sec resolution on Linux, 10 msec resolution on Windows NT) –Lots of overhead making call to OS –Different underlying implementations give different resolutions #include struct timeval tstart, tfinish; double tsecs; gettimeofday(&tstart, NULL); P(); gettimeofday(&tfinish, NULL); tsecs = (tfinish.tv_sec - tstart.tv_sec) + 1e6 * (tfinish.tv_usec - tstart.tv_usec);

CS 213 S’00– 4 – class29.ppt Interval (Count-Down) Timers set timer to some initial value timer counts down toward zero coarse grained (e.g., 10 msec resolution on Linux) void init_etime() { first.it_value.tv_sec = 86400; setitimer(ITIMER_VIRTUAL, &first, NULL); } init_etime(); secs = get_etime(); P(); secs = get_etime() - secs; printf(“%lf secs\n”, secs); double get_etime() { struct itimerval curr; getitimer(ITIMER_VIRTUAL,&curr); return(double)( (first.it_value.tv_sec - curr.it_value.tv_sec) + (first.it_value.tv_usec - curr.it_value.tv_usec)*1e-6); Using the interval timer

CS 213 S’00– 5 – class29.ppt Cycle Counters Most modern systems have built in registers that are incremented every clock cycle –Very fine grained –Maintained as part of process state »Save & restore with context switches »Counter will reflect time spent by user process Special assembly code instruction to access On (recent model) Intel machines: –64 bit counter. –RDTSC instruction sets %edx to high order 32-bits, %eax to low order 32-bits Wrap Around Times for 550 MHz machine Low order 32-bits wrap around every 2 32 / (550 * 10 6 ) = 7.8 seconds High order 64-bits wrap around every 2 64 / (550 * 10 6 ) = seconds – years

CS 213 S’00– 6 – class29.ppt Using the Cycle Counter Example –Function that returns number of cycles elapsed since previous call to function –Express as double to avoid overflow problems /* Keep track of most recent reading of cycle counter */ static unsigned cyc_hi = 0; static unsigned cyc_lo = 0; static double delta_cycles() { unsigned ncyc_hi, ncyc_lo; double result; /* Get cycle counter as ncyc_hi and ncyc_lo */... /* Do double precision subtraction */... cyc_hi = ncyc_hi; cyc_lo = ncyc_lo; return result; }

CS 213 S’00– 7 – class29.ppt Accessing the Cycle Counter (cont.) GCC allows inline assembly code with mechanism for matching registers with program variables Code only works on x86 machine compiling with GCC Emit assembly with rdtsc and two movl instructions Code generates two outputs: –Symbolic register %0 should be used for ncyc_hi –Symbolic register %1 should be used for ncyc_lo Code has no inputs Registers %eax and %edx will be overwritten unsigned ncyc_hi, ncyc_lo; /* Get cycle counter */ asm("rdtsc\nmovl %edx,%0\nmovl %eax,%1" : "=r" (ncyc_hi), "=r" (ncyc_lo) : /* No input */ : "%edx", "%eax");

CS 213 S’00– 8 – class29.ppt Accessing the Cycle Counter (cont.) Emitted Assembly Code delta_cycles: pushl %ebp# Stack stuff movl %esp,%ebp pushl %esi pushl %ebx #APP rdtsc# Result of ASM Statement movl %edx,%esi# Uses %esi for ncyc_hi movl %eax,%ecx# Uses %ecx for ncyc_lo #NO_APP movl %ecx,%ebx# ncyc_lo subl cyc_lo,%ebx cmpl %ecx,%ebx seta %al xorl %edx,%edx movb %al,%dl movl %esi,%eax# ncyc_hi

CS 213 S’00– 9 – class29.ppt Using the Cycle Counter (cont.) /* Keep track of most recent reading of cycle counter */ static unsigned cyc_hi = 0; static unsigned cyc_lo = 0; static double delta_cycles() { unsigned ncyc_hi, ncyc_lo; unsigned hi, lo, borrow; double result;... /* Do double precision subtraction */ lo = ncyc_lo - cyc_lo; borrow = lo > ncyc_lo; hi = ncyc_hi - cyc_hi - borrow; result = (double) hi * (1 << 30) * 4 + lo;... }

CS 213 S’00– 10 – class29.ppt Timing with Cycle Counter double tsecs; delta_cycles(); P(); tsecs = delta_cycles() / (MHZ * 1e6);

CS 213 S’00– 11 – class29.ppt Measurement Pitfalls Overhead Calling delta_cycles() incurs small amount of overhead Want to measure long enough code sequence to compensate Unexpected Cache Effects artificial hits or misses e.g., these measurements were taken with the Alpha cycle counter: foo1(array1, array2, array3); /* 68,829 cycles */ foo2(array1, array2, array3); /* 23,337 cycles */ vs. foo2(array1, array2, array3); /* 70,513 cycles */ foo1(array1, array2, array3); /* 23,203 cycles */

CS 213 S’00– 12 – class29.ppt Dealing with Overhead & Cache Effects Keep doubling number of times execute P() until reach some threshold –Used CMIN = int cnt = 1; double cmeas = 0; double cycles; do { int c = cnt; P();/* Warm up cache */ (void) delta_cycles(); while (c-- > 0) P(); cmeas = delta_cycles(); cycles = cmeas / cnt; cnt += cnt; } while (cmeas < CMIN); /* Make sure have enough */ return cycles / (1e6 * MHZ);

CS 213 S’00– 13 – class29.ppt Context Switching Context switches can also affect cache performance e.g., ( foo1, foo2 ) cycles on an unloaded timing server: »71,002, 23,617 »67,968, 23,384 »68,840, 23,365 »68,571, 23,492 »69,911, 23,692 Why Do Context Switches Matter? Cycle counter only accumulates when running user process Some amount of overhead Caches polluted by OS and other user’s code & data –Cold misses as restart process Measurement Strategy Try to measure uninterrupted code execution

CS 213 S’00– 14 – class29.ppt Detecting Context Switches Clock Interrupts Processor clock causes interrupt every  t seconds –Typically  t = 10 ms –Same as interval timer resolution Can detect by seeing if interval timer has advanced during measurement time tt start = get_etime(); /* Perform Measurement */... if (get_etime() - start > 0) /* Discard measurement */ Measurement takes place without interval timer advancing

CS 213 S’00– 15 – class29.ppt Detecting Context Switches (Cont.) External Interrupts E.g., due to completion of disk operation Occur at unpredictable times but generally take a long time to service Detecting See if real time clock has advanced –Using coarse-grained interval timer Reliability Good, but not 100% Can’t get clean measurements on heavily loaded system start = get_rtime(); /* Perform Measurement */... if (get_rtime() - start > 0) /* Discard measurement */

CS 213 S’00– 16 – class29.ppt Improving Accuracy Current Timer Code Assume that bad measurements always overestimate time –True if main problem is due to context switches Take multiple samples (2–10) until lowest two are within some small tolerance of each other Better Timing Code Erroneous measurements both under- and over-estimate time, but are not correlated to each other Look for clustering of times among samples

CS 213 S’00– 17 – class29.ppt Measurement Summary It’s difficult to get accurate times compensating for overhead but can’t always measure short procedures in loops –global state –mallocs –changes cache behavior It’s difficult to get repeatable times cache effects due to ordering and context switches Moral of the story: Adopt a healthy skepticism about measurements! Always subject measurements to sanity checks.

CS 213 S’00– 18 – class29.ppt Amdahl’s Law You plan to visit a friend in Normandy France and must decide whether it is worth it to take the Concorde SST ($3,100) or a 747 ($1,021) from NY to Paris, assuming it will take 4 hours Pgh to NY and 4 hours Paris to Normandy. time NY  Paristotal trip timespeedup over hours16.5 hours1 SST3.75 hours11.75 hours1.4 Taking the SST (which is 2.2 times faster) speeds up the overall trip by only a factor of 1.4!

CS 213 S’00– 19 – class29.ppt Speedup T1T1 T2T2 Old program (unenhanced) T 1 = time that can NOT be enhanced. T 2 = time that can be enhanced. T 2 = time after the enhancement. Old time: T = T 1 + T 2 T 1 = T 1 T 2  T 2 New program (enhanced) New time: T = T 1 + T 2 Speedup: S overall = T / T

CS 213 S’00– 20 – class29.ppt Computing Speedup Two key parameters: F enhanced = T 2 / T (fraction of original time that can be improved) S enhanced = T 2 / T 2 (speedup of enhanced part) T = T 1 + T 2 = T 1 + T 2 = T(1-F enhanced ) + T 2 = T(1 – F enhanced ) + (T 2 /S enhanced ) [by def of S enhanced ] = T(1 – F enhanced ) + T(F enhanced /S enhanced ) [by def of F enhanced ] = T((1 – F enhanced ) + F enhanced /S enhanced ) Amdahl’s Law: S overall = T / T = 1/((1 – F enhanced ) + F enhanced /S enhanced ) Key idea: Amdahl’s Law quantifies the general notion of diminishing returns. It applies to any activity, not just computer programs.

CS 213 S’00– 21 – class29.ppt Amdahl’s Law Example Trip example: Suppose that for the New York to Paris leg, we now consider the possibility of taking a rocket ship (15 minutes) or a handy rip in the fabric of space-time (0 minutes): time NY->Paristotal trip timespeedup over hours16.5 hours1 SST3.75 hours11.75 hours1.4 rocket0.25 hours8.25 hours2.0 rip0.0 hours8 hours2.1

CS 213 S’00– 22 – class29.ppt Seek Time = 6ms Disk surface spins at 3600*k RPM read/write head arm Magnetic Disk Example Average Rotational Latency 1/2 revolution takes 1 / (120 * k) seconds = 8.5/k milliseconds Total Latency: k = 1: 14.5 ms1.0X k = 4: 8.1 ms1.8X

CS 213 S’00– 23 – class29.ppt Lesson from Amdahl’s Law Useful Corollary of Amdahl’s law: 1  S overall  1 / (1 – F enhanced ) F enhanced Max S overall Moral: It is hard to speed up a program. Moral++ : It is easy to make premature optimizations.

CS 213 S’00– 24 – class29.ppt Other Maxims Second Corollary of Amdahl’s law: When you identify and eliminate one bottleneck in a system, something else will become the bottleneck Beware of Optimizing on Small Benchmarks Easy to cut corners that lead to asymptotic inefficiencies –E.g., Intel’s string hash function