Copyright 2014 – Noah Mendelsohn Performance and Code Tuning Noah Mendelsohn Tufts University Web:

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Simplifications of Context-Free Grammars
Variations of the Turing Machine
3rd Annual Plex/2E Worldwide Users Conference 13A Batch Processing in 2E Jeffrey A. Welsh, STAR BASE Consulting, Inc. September 20, 2007.
AP STUDY SESSION 2.
1
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Introduction to Algorithms 6.046J/18.401J
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Programming Language Concepts
1 Advanced Tools for Account Searches and Portfolios Dawn Gamache Cindy Bylander.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Office 2003 Introductory Concepts and Techniques M i c r o s o f t Windows XP Project An Introduction to Microsoft Windows XP and Office 2003.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
1.
Break Time Remaining 10:00.
Factoring Quadratics — ax² + bx + c Topic
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Red Tag Date 13/12/11 5S.
Database Performance Tuning and Query Optimization
PP Test Review Sections 6-1 to 6-6
User Friendly Price Book Maintenance A Family of Enhancements For iSeries 400 DMAS from Copyright I/O International, 2006, 2007, 2008, 2010 Skip Intro.
1 The Blue Café by Chris Rea My world is miles of endless roads.
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Chapter 10: Virtual Memory
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
While Loop Lesson CS1313 Spring while Loop Outline 1.while Loop Outline 2.while Loop Example #1 3.while Loop Example #2 4.while Loop Example #3.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Pointers and Arrays Chapter 12
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Essential Cell Biology
Copyright 2013 – Noah Mendelsohn Compiling C Programs Noah Mendelsohn Tufts University Web:
Chapter 13 Web Page Design Studio
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Manipulating Bit Fields in C Noah Mendelsohn Tufts University Web: COMP 40: Machine.
Copyright Tim Morris/St Stephen's School
Profile. 1.Open an Internet web browser and type into the web browser address bar. 2.You will see a web page similar to the one on.
User Defined Functions Lesson 1 CS1313 Fall User Defined Functions 1 Outline 1.User Defined Functions 1 Outline 2.Standard Library Not Enough #1.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Abstraction, Modularity, Interfaces and Pointers Original slides by Noah Mendelsohn, including content from Mark Sheldon, Noah Daniels, Norman Ramsey COMP.
Copyright 2014 – Noah Mendelsohn Introduction to Software Performance & Tuning Noah Mendelsohn Tufts University
Copyright 2014 – Noah Mendelsohn Code Tuning Noah Mendelsohn Tufts University Web:
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
Presentation transcript:

Copyright 2014 – Noah Mendelsohn Performance and Code Tuning Noah Mendelsohn Tufts University Web: COMP 40: Machine Structure and Assembly Language Programming (Spring 2014)

© 2010 Noah Mendelsohn 2 Today When & how to tune Factors that affect performance Barriers to performance Code Tuning

© 2010 Noah Mendelsohn 3 When to Optimize

© 2010 Noah Mendelsohn When to optimize Know your performance goals –Dont create a complex implementation if simple is fast enough –E.g. compute display frames 30/sec –Question: are you sharing the machine? If not, use it all! –Question: why do programmers use slow high level languages (Ruby, Haskell, Spreadsheets)? Introduce complexity only where necessary for speed Dont guess, measure! Goal: understand where the time is going and why before changing anything! 4

© 2010 Noah Mendelsohn Choose good algorithms Matters if processing lots of data Try to use existing implementations of good algorithms Otherwise, only use complex algorithms if data size merits doing so Leave tracks in comments or elsewhere if your algorithms wont scale 5

© 2010 Noah Mendelsohn 6 Measuring Performance

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems u 0.077s 0: % 0+0k 0+0io 0pf+0w $ time quick_sort < rand

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems 8 $ time quick_sort < rand u 0.077s 0: % 0+0k 0+0io 0pf+0w

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems 9 $ time quick_sort < rand u 0.077s 0: % 0+0k 0+0io 0pf+0w

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems 10 $ time quick_sort < rand u 0.077s 0: % 0+0k 0+0io 0pf+0w

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems 11 $ time quick_sort < rand u 0.077s 0: % 0+0k 0+0io 0pf+0w This program took 1.25 seconds to complete. The CPU was working on the program 94.4 % of that time seconds of CPU time was spent in the users own code were spend in the operating system doing work for the user (probably code to support file input/output)

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems 12

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code (user time) –CPU time from system work on your behalf –Waiting time: suggests I/O problems 13 Your program can also include timing checks in the code to measure specific sections. For more information, explore: man 2 times man 3 clock man 2 getrusage

© 2010 Noah Mendelsohn What are we measuring Time –Elapsed time wall clock –CPU time from your code –CPU time from system work on your behalf –Waiting time: suggests I/O problems I/O (amount read/written and rate at which I/O is done) Paging rates (OS writes data to disk when memory is over-committed) Contention from other users 14

© 2010 Noah Mendelsohn 15 Performance Modeling

© 2010 Noah Mendelsohn What is a performance model? A model you develop to predict where time will be spent by your program To build a model, you need: –A clear understanding of how computer hardware performs –A clear understanding (or guess) as to how compilers will translate your programs –A high level picture of the flow of your program and where it will spend time Your model will also depend on the size and nature of the data your program processes When you care about performance, build the best mental model you can, then measure to learn and update your model 16 The first few times you do this youll find that your predictive powers arent good…keep at it, and youll develop a good intuitive sense of why your programs perform as they do

© 2010 Noah Mendelsohn What is a performance model? A model you develop to predict where time will be spent by your program To build a model, you need: –A clear understanding of how computer hardware performs –A clear understanding (or guess) as to how compilers will translate your programs –A high level picture of the flow of your program and where it will spend time Your model will also depend on the size and nature of the data your program will process When you care about performance, build the best mental model you can, then measure to learn and update your model 17 The first few times you do this youll find that your predictive powers arent good…keep at it, and youll develop a good intuitive sense of why your programs perform as they do Do not let your growing skills at predicting performance tempt you into adding unnecessary complexity into your programs… …even programmers who are skilled at modeling and prediction can make the mistake of optimizing prematurely

© 2010 Noah Mendelsohn 18 Don't do premature optimization. Write the program well. Choose a good algorithm. Then measure performance, then improve the parts that most affect performance.

© 2010 Noah Mendelsohn The bank robbery model of code tuning: Hot spots 19 Remember: 80% of the time is usually spent in 20% of the code. Tune that 20% only! Q: Mr. Sutton, why do you rob banks? A: That's where the money is.

© 2010 Noah Mendelsohn Aside: more on Willie Sutton 20 The irony of using a bank robber's maxim […] is compounded, I will now confess, by the fact that I never said it. The credit belongs to some enterprising reporter who apparently felt a need to fill out his copy... If anybody had asked me, I'd have probably said it. That's what almost anybody would say...it couldn't be more obvious. In his partly ghostwritten autobiography, Where the Money Was: The Memoirs of a Bank Robber (Viking Press, New York, 1976), Sutton dismissed this story, saying: Source:

© 2010 Noah Mendelsohn Aside: more on Willie Sutton 21 The irony of using a bank robber's maxim […] is compounded, I will now confess, by the fact that I never said it. The credit belongs to some enterprising reporter who apparently felt a need to fill out his copy... If anybody had asked me, I'd have probably said it. That's what almost anybody would say...it couldn't be more obvious. Or could it? Why did I rob banks? Because I enjoyed it. I loved it. I was more alive when I was inside a bank, robbing it, than at any other time in my life. I enjoyed everything about it so much that one or two weeks later I'd be out looking for the next job. But to me the money was the chips, that's all. Go where the money is...and go there often. In his partly ghostwritten autobiography, Where the Money Was: The Memoirs of a Bank Robber (Viking Press, New York, 1976), Sutton dismissed this story, saying: Source:

© 2010 Noah Mendelsohn 22 Measurement and Analysis Techniques

© 2010 Noah Mendelsohn Tools for understanding what your code is doing Statistics from the OS: –Overview of resources your program uses –Examples: /usr/bin/time command; top ; etc. many others –Also look for tools that can report on I/O or network activity if pertinent 23

© 2010 Noah Mendelsohn Tools for understanding what your code is doing Statistics from the OS: –Overview of resources your program uses –Examples: /usr/bin/time command; top ; etc. many others –Also look for tools that can report on I/O or network activity if pertinent Profilers: –Typically bill CPU time (or other resources) to specific functions or lines of code –Examples: gprof (older, but simple & fast to use); kcachegrind 24

© 2010 Noah Mendelsohn kcachgrind analysis of a quick sort 25

© 2010 Noah Mendelsohn Tools for understanding what your code is doing Statistics from the OS: –Overview of resources your program uses –Examples: /usr/bin/time command; top ; etc. many others –Also look for tools that can report on I/O or network activity if pertinent Profilers: –Typically bill CPU time (or other resources) to specific functions or lines of code –Examples: gprof (older, but simple & fast to use); kcachegrind Instrument your code –Add counters, etc. for code or I/O events that concern you –Use system calls to do microsecond timing of important sections of code –Print output at end, or (more sophisticated) implement real time monitoring UI –Make sure you understand whether your instrumentation is affecting performance of your code! 26

© 2010 Noah Mendelsohn Tools for understanding what your code is doing Statistics from the OS: –Overview of resources your program uses –Examples: /usr/bin/time command; top ; etc. many others –Also look for tools that can report on I/O or network activity if pertinent Profilers: –Typically bill CPU time (or other resources) to specific functions or lines of code –Examples: gprof (older, but simple & fast to use); kcachegrind Instrument your code –Add counters, etc. for code or I/O events that concern you –Use system calls to do microsecond timing of important sections of code –Print output at end, or (more sophisticated) implement real time monitoring UI –Make sure you understand whether your instrumentation is affecting performance of your code! Read the generated assembler code 27

© 2010 Noah Mendelsohn Aside: why use C if were reading the Assembler? 1.You probably wont be reading most of the assembler, just critical parts 2.Writing C and reading assembler is usually easier than writing asm 3.If you write in C and move to another machine architecture, your tuning modifications may no longer work well, but at least the program will run 28

© 2010 Noah Mendelsohn 29 Barriers to Performance

© 2010 Noah Mendelsohn Performance killers Too much memory traffic –Remember: memory may be 100x slower than registers –Performance model: learn to guess what the compiler will keep in registers –Watch locality: cache hits are still much better than misses! –Note that large structures usually result in lots of memory traffic & poor cache performance Malloc / free –They are slow, and memory leaks can destroy locality Excessively large data structures – bad locality Too many levels of indirection (pointer chasing) –Can result from efforts to create clean abstractions! 30

© 2010 Noah Mendelsohn Example of unnecessary indirection Compare: struct Array2_T { int width, height; char *elements; }; … a->elements[i] /* same as *((*a).elements + i) */ With struct Array2_T { int width, height; char elements[]; }; … a->elements[i] /* *((a + offset(elements)) + i) */ 31 Struct Array2_T elements a Struct Array2_T elements a

© 2010 Noah Mendelsohn Example of unnecessary indirection Compare: struct Array2_T { int width, height; char *elements; }; … a->elements[i] /* same as *((*a).elements + i) */ With struct Array2_T { int width, height; char elements[]; }; … a->elements[i] /* *((a + offset(elements)) + i) */ 32 Struct Array2_T elements a Struct Array2_T elements a Tricky C construct: If you malloc extra space at the end of the struct, C will let you address those as array elements!

© 2010 Noah Mendelsohn Performance killers Too much memory traffic –Remember: memory may be 100x slower than registers –Performance model: learn to guess what the compiler will keep in registers –Watch locality: cache hits are still much better than misses! –Note that large structures usually result in lots of memory traffic & poor cache performance Malloc / free –They are slow, and memory leaks can destroy locality Excessively large data structures – bad locality Too many levels of indirection (pointer chasing) –Can result from efforts to create clean abstractions! Too many function calls 33

© 2010 Noah Mendelsohn 34 Barriers to Compiler Optimization

© 2010 Noah Mendelsohn What inhibits compiler optimization? Unnecessary function calls –For small functions, call overhead may exceed time for useful work! –Compiler can eliminate some not all Unnecessary recomputation in loops (especially function calls) Bad: for(i=0; i < array.length(); i++) constant? Better: len = array.length(); for(i=0; i < len; i++) Duplicate code: Bad: a = (f(x) + g(x)) * (f(x) / g(x)) Better : ftmp = f(x); gtmp = g(x); a = (ftmp + gtmp) * (ftmp / gtmp) Aliasing (next slide) 35

© 2010 Noah Mendelsohn Aliasing: two names for the same thing 36 int my_function(int *a, int *b) { *b = (*a) * 2; *a = (*a) * 2; return *a + *b; } int c = 3; int d = 4; result = my_function(&c, &d); Result is: 14 int my_function(int a, int b) { b = a * 2; a = a * 2; return a + b; } int c = 3; int d = 4; result = my_function(c, d); Result is: 14 (c and d are updated) Compare these functions, which are quite similar

© 2010 Noah Mendelsohn Aliasing: two names for the same thing 37 int my_function(int *a, int *b) { *b = (*a) * 2; *a = (*a) * 2; return *a + *b; } int c = 3; result = my_function(&c, &c); Correct result is: 24 (and c is left with value 12!)

© 2010 Noah Mendelsohn Aliasing: two names for the same thing 38 int my_function(int *a, int *b) { *b = (*a) * 2; *a = (*a) * 2; return *a + *b; } int c = 3; result = my_function(&c, &c); Just in case arguments are the same… …compiler can never eliminate the duplicate computation! Correct result is: 24 (and c is left with value 12!)

© 2010 Noah Mendelsohn Aliasing 39 int my_function(int *a, int *b) { *b = (*a) * 2; *a = (*a) * 2; return *a + *b; } int c = 3; result = my_function(&c, &c); Even worse, if my_function is in a different source file, compiler has to assume the function may have kept a copy of &c…now it has to assume most any external function call could c (or most anything else!)

© 2010 Noah Mendelsohn 40 Optimization Techniques

© 2010 Noah Mendelsohn Code optimization techniques Moving code out of loops Common subexpression elimination When in doubt, let the compiler do its job –Higher optimization levels (e.g. gcc –O2) Inlining and macro expansion –Eliminate overhead for small functions 41

© 2010 Noah Mendelsohn Macro expansion to save function calls 42 int max(int a, int b) { return (a > b) ? a : b; } int x = max(c,d);

© 2010 Noah Mendelsohn Macro expansion to save function calls 43 int max(int a, int b) { return (a > b) ? a : b; } int x = max(c,d); #define max(a,b) ((a) > (b) ? (a) : (b)) Expands directly to: int x = ((c) > (d) ? (c) : (d)); Whats the problem with: int x = max(c++, d); ??

© 2010 Noah Mendelsohn Inlining: best of both worlds 44 int max(int a, int b) { return (a > b) ? a : b; } int x = max(c,d); #define max(a,b) ((a) > (b) ? (a) : (b)) Compiler expands directly to: int x = ((c) > (d) ? (c) : (d)); Now theres no problem with: int x = max(c++, d);

© 2010 Noah Mendelsohn Inlining: best of both worlds 45 int max(int a, int b) { return (a > b) ? a : b; } int x = max(c,d); Compiler expands directly to: int x = ((c) > (d) ? (c) : (d)); inline int max(int a, int b) { return (a > b) ? a : b; } Most compilers do a good job anyway, but you can suggest which functions to inline. Still, we mostly discourage explicit inline specifications these days, as compilers do a very good job without them!

© 2010 Noah Mendelsohn Code optimization techniques Moving code out of loops Common subexpression elimination When in doubt, let the compiler do its job –Higher optimization levels (e.g. gcc –O2) Inlining and macro expansion –Eliminate overhead for small functions Specialization 46

© 2010 Noah Mendelsohn Specialization 47 static int sum_squares(int x, int y) { return ((x * x) + (y * y)); } … int num1 = 7; int num2 = 3; sum = sum_squares(6, 3); (6*6) + (3*3) = 45 gcc –O2 generates one instruction!: movl$45, %r9d

© 2010 Noah Mendelsohn Specialization with header files – must use static 48 static int sum_squares(int x, int y) { return ((x * x) + (y * y)); } sum_squares.h #include sum_squares.h sum = sum_squares(6, 3); src1.c #include sum_squares.h sum = sum_squares(6, 5); src2.c gcc –p propgram src1.o src2.o Without static declaration, sum_square is declared in duplicate in src1.o and src2.o

© 2010 Noah Mendelsohn Code optimization techniques Moving code out of loops Common subexpression elimination When in doubt, let the compiler do its job –Higher optimization levels (e.g. gcc –O2) Inlining and macro expansion –Eliminate overhead for small functions Specialization Optimizations often compound: inlining and specialization Dont worry about asserts unless tests take lots of time Conditional compilation of debug code: #ifdef DEBUG can use gcc –DDEBUG to define the flag 49