October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13.

Slides:



Advertisements
Similar presentations
Analysis of Algorithms II
Advertisements

Chapter 17 vector and Free Store Bjarne Stroustrup
Chapter 4 Computation Bjarne Stroustrup
Chapter 6 Writing a Program
1 Exceptions: An OO Way for Handling Errors Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Laboratory Dept. of Computer Science and Software.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Introduction to C Programming
Credit hours: 4 Contact hours: 50 (30 Theory, 20 Lab) Prerequisite: TB143 Introduction to Personal Computers.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
RSA.
Peer-to-peer and agent-based computing Basic Theory of Agency (Contd)
Turing Machines January 2003 Part 2:. 2 TM Recap We have seen how an abstract TM can be built to implement any computable algorithm TM has components:
Using search for engineering diagnostics and prognostics Jim Austin.
Cache Storage For the Next Billion Students: Anirudh Badam, Sunghwan Ihm Research Scientist: KyoungSoo Park Presenter: Vivek Pai Collaborator: Larry Peterson.
Suite Suite 2 TPF Software – Overview Binary Editor Remote Scripts zTREX Add-Ins & Project Integration with Source Control Manager.
Earn Passive Residual Income On Autopilot!
Lilian Blot 11 Spring 2014 TPOP 1. Lilian Blot 22 Spring 2014 TPOP 2.
Maintaining data quality: fundamental steps
1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland
QA practitioners viewpoint
Topic 14 Searching and Simple Sorts "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The.
Modern Programming Languages, 2nd ed.
Operating Systems: Monitors 1 Monitors (C.A.R. Hoare) higher level construct than semaphores a package of grouped procedures, variables and data i.e. object.
Activity 1………………Saving vs. Investing Activity 2……….….Saving for a Rainy Day Activity 3…………………… = Saving Activity 4…..Investing for the Long Term.
Squares and Square Root WALK. Solve each problem REVIEW:
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.
Lecture 8: Testing, Verification and Validation
Lecture 5: Requirements Engineering
Slide 1 Shall Lists. Slide 2 Shall List Statement Categories  Functional Requirements  Non-Functional Requirements.
Combine Like Terms. Simplify the Given Expression Below:
1 Chapter 4 The while loop and boolean operators Samuel Marateck ©2010.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 4 Loops.
1 Advanced Archive-It Application Training: Quality Assurance October 17, 2013.
Executional Architecture
1 of 31 Images from Africa. 2 of 31 My little Haitian friend Antoine (1985)
 .
CSci 1130 Intro to Programming in Java
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Week 1.
Programming Fundamentals 1 st lecture. ELTE 2/ Content  Steps of solving problems – steps of writing a computer.
Complexity Analysis (Part II)
Lilian Blot CORE ELEMENTS SELECTION & FUNCTIONS Lecture 3 Autumn 2014 TPOP 1.
18-Dec-14 Pruning. 2 Exponential growth How many leaves are there in a complete binary tree of depth N? This is easy to demonstrate: Count “going left”
1 Chapter 3:Operators and Expressions| SCP1103 Programming Technique C | Jumail, FSKSM, UTM, 2006 | Last Updated: July 2006 Slide 1 Operators and Expressions.
What are we going to do? CFU Students, you already know how to identify variable terms and constant terms in an expression. Now, we will identify like.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 14: Protection.
CS 1 Introduction CS 1 Part 11. Hardware 1.Central Processing Unit (CPU) 2.Main Memory 3.Secondary Memory / Storage 4.Input Devices 5.Output Devices CS.
Introduction to Programming G51PRG University of Nottingham Revision 1
10.3 Simplifying Radical Expressions
© Copyright 1992–2005 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Tutorial 13 – Salary Survey Application: Introducing.
CS1022 Computer Programming & Principles
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
1 Conditions Logical Expressions Selection Control Structures Chapter 5.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Midterm Review Tami Meredith. Primitive Data Types byte, short, int, long Values without a decimal point,..., -1, 0, 1, 2,... float, double Values with.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
Algorithm Discovery and Design
Presentation transcript:

October 20-24

Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Primitive Performance Goal: Constantly improve the performance of the existing primitive functions and operators Two main problems... Hard: Deciding what to optimise –Easy: Clever people must think of better algorithms Hard: Dont accidentally cause slowdowns –Hard: Even understanding whether it happened Primitive Performance 3

Prioritizing Tuning Deciding where to start: APLMON: Profiles APL Interpreter ]PROFILE: Profiles application code Customer benchmarks –Please send us your code! Comparisons with other array languages –Internal testing –External benchmarks –Conversion projects to Dyalog APL Primitive Performance 4

Dont slow anything down!!! Over time, there is a tendency for things to slow down as features are added –Unicode, 64-bit, OO, better error messages, etc... –Sometimes even as a side-effect of tuning work Solution: The Performance Quality Assurance (PQA) Framework: Internal tool for the Dyalog Development team to measure the performance of individual primitives and the execution framework on a daily basis Primitive Performance 5

PQA Project Goals Reliably detect slowdowns greater than 2%, in any primitive function or operator expression Publish a performance certificate for each release –No surprises for customers: Slowdowns that we cannot compensate should be expained (e.g. 64-bit project) –Hard evidence of speed-ups for the world to see Run PQA continuously during development, catch performance degradation immediately! –Avoid the expensive search for the bad code change sometime last year –Important: Avoid false positives (they are VERY expensive) Primitive Performance 6

Challenges A huge number of different cases to generate and test Getting repeatable timings is extraordinarily difficult –Some timings are TINY e.g. 0+0 Huge volume of data to analyse Primitive Performance 7

Huge Number of Cases APL is our friend PQA framework generates ~14,000 different APL expressions ~600 different variables are created for use by different expressions Each expression is repeated for approximately 3-4 seconds –Currently split into 10 runs of <0.5 secs each Primitive Performance 8

100 Expressions (selected at random) +z2 ×i2 |l2 s4 ÷zn0 xs4+ys4 xi1×yb1 xl2|yi2 xi4 yl4 xi1<ys1 xi1ys1 xl0>yb0 xi0yl0 ¯11yd1 xa4 ya2 xb2 yi1 xb2 ys2 xs0 yl2 xs0ys0 (... and about 13,900 others) Primitive Performance 9 xs1~ys0 xs4 yb4 xi2 yl0 xd0~yd1 xz1 yz1 bw4 +\iw2 \dw2 \bt1 lt4 -/bq2 \bq1 sq4 xb2.+yb1 10[0] bw1 10[0] zt0 ¯10[0] bt1,dv1 av2 lv4 iv1 at1,xq2,zq1 aq4 s xv4 s dv2 ¯10lv0 bv1,sv1 sv4,dv4 av4 lv4 xv0 av0 xv0 dv4 iv4zv0 lv0zv4 dv0xv0 dv0 iv0 dv4 bv4 s zv4 ¯1aw0 ¯1lw2 10lw2 11 ¯10sw2 ¯1 xw4 ¯10 lw1 sw2,iw1 zw2,lw2 xw0 dw4 xw4 sw4 bw4 lw0 iw4sw0 iw4 dw0 iw4 dw4 zw0 sw4 zw4xw4 zw4 sw4 s at2 10it2 ¯10dt4 11 ¯1at0 11 ¯1zt2 10 bt2 ¯10 bt4 zt0 it2 zt2 lt2 st0,lt0 at4 dt0 xt4 it0 st0it0 lt4zt4 11 1bq0 11 ¯10sq2 (j0 k) xv4 (k k4) zw4 bv2[j0]bv0 dv2[j0]dv0 paren10 { [ ]}iv4

Variables (~600) left (x) or right (y) argument datatype b - boolean, 1 bit s - short integer, 1 byte i - integer, 2 bytes l - long integer, 4 bytes d - double, 8 bytes z - complex, 2 doubles, 16 bytes f - DECF a - alphanumeric, 1-byte character x – enclosed (char vectors) length, usually number of elements, but can be number of rows or number of columns 0 - 1e0, vector or scalar or singleton 1 - 1e e e4 Examples: zn0: complex non-zero scalar l4: 10,000 element long integer sp1: 10 element 1-byte ints >1 kind of array v - vector (but *v0 is a scalar) t - tall matrix, 11-column matrix with 10* rows w - wide matrix, 11-row matrix with 10* columns q - square matrix with 10* ×0.6 ( ) rows/columns domain (for variables used in scalar functions) n - non-zero p - positive and ~ 0 1 u - unit circle; used in inverse trig functions special c k - scalar indices ?6 i – int vector of file com nos or native file indices j0 j1 j2 - index vectors of length 1e0 1e1 1e2 k1 k2 k4 - index vectors of length 7 in the range 1e1 1e2 1e4 bvc svc ivc lvc element vectors of various types d2: 100 element double bt2: 100x11 boolean matrix xw4: 11x10000 matrix of enclosed char vectors Primitive Performance 10

Repeatable Timings... Use a dedicated machine (real, NOT virtual!) –At Dyalog: 4 cores, 96Gb RAM, nothing installed except APL Run processes at high or realtime priority Pre-expand workspaces using 2000 Control workspace compactions carefully Carefully craft the execution loop to have minimum variable overhead Primitive Performance 11

The Inner Loop (1/2) ra TIMEX b;ai;cnt;n;rep;min;e;sum;kt;m [1] cnt0 We will try 3 times [2] :Repeat [3] ktGetPrivilegedProcessorTime Will be checked at end [4] cntcnt+1 [5] ai AI Record CPU & Elapsed time [6] {} WA Compact workspace [7] pqa_cal_wait Check time of calibration expression [8] pqa_redef b ensure args are in new pockets [9] :If 0reppqa_REPS[pqa_I] r minpqa_TIME[pqa_I;1] eb (use reps set in file to be compared with) [10] :Else [11] min1 /r10 timefx eb [12] :If /mpqa_reps_EXPR.=(¯1 pqa_reps_EXPR)b [13] reppqa_reps_REPS[m 1] [14] :Else rep1 pqa_rep_ticks pqa_rep_ticks÷min reps required to get 200 ticks (70 microsec) [15] :EndIf [16] :EndIf Primitive Performance 12

Recorded Data The complete distribution of [several thousand] timings for each expression is recorded –The inner loop size for each expression is recorded and can be used as input to the next recording to create [more] comparable timings Deciding what the data means in not easy... Primitive Performance 13

Reporting Providing useful reports on such a large quantity of data is a huge challenge. Report needs to quickly identify bad (and good) news, without false positives. –A report with many false positives is worse than useless Current run-time for data collection is ~13 hours, which makes tool hard to use during development Primitive Performance 14

The Hardest Part (for me) Primitive Performance 15

A More Interesting Report Primitive Performance 16

Primitive Performance 17 V14.0 With 3 Months To Go

Planned Work Finalize report format and issue official 13.2 report – then 14.0 Hook reporting tool up to internal (MiServer-based) web server so entire development team can drill down and schedule runs Create shorter test and web-based scheduler for ad hoc use by developers needing short turnaround to verify a change Bring APLMON categories in line with PQA, so an APLMON profile can be combined with PQA data to predict performance (might work) Holy Grail: Hook PQA up to overnight build system, so updates are blocked if a fix causes a degradation (and responsible developer fined for not running the test himself) Primitive Performance 18

Credits Most of the real work done by Roger Hui –(and most of the tuning, too!) Morten is still working on getting reproducible/stable numbers and reporting Primitive Performance 19