October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13.

October 20-24

Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13

Primitive Performance Goal: Constantly improve the performance of the existing primitive functions and operators Two main problems... Hard: Deciding what to optimise –Easy: Clever people must think of better algorithms Hard: Dont accidentally cause slowdowns –Hard: Even understanding whether it happened Primitive Performance 3

Prioritizing Tuning Deciding where to start: APLMON: Profiles APL Interpreter ]PROFILE: Profiles application code Customer benchmarks –Please send us your code! Comparisons with other array languages –Internal testing –External benchmarks –Conversion projects to Dyalog APL Primitive Performance 4

Dont slow anything down!!! Over time, there is a tendency for things to slow down as features are added –Unicode, 64-bit, OO, better error messages, etc... –Sometimes even as a side-effect of tuning work Solution: The Performance Quality Assurance (PQA) Framework: Internal tool for the Dyalog Development team to measure the performance of individual primitives and the execution framework on a daily basis Primitive Performance 5

PQA Project Goals Reliably detect slowdowns greater than 2%, in any primitive function or operator expression Publish a performance certificate for each release –No surprises for customers: Slowdowns that we cannot compensate should be expained (e.g. 64-bit project) –Hard evidence of speed-ups for the world to see Run PQA continuously during development, catch performance degradation immediately! –Avoid the expensive search for the bad code change sometime last year –Important: Avoid false positives (they are VERY expensive) Primitive Performance 6

Challenges A huge number of different cases to generate and test Getting repeatable timings is extraordinarily difficult –Some timings are TINY e.g. 0+0 Huge volume of data to analyse Primitive Performance 7

Huge Number of Cases APL is our friend PQA framework generates ~14,000 different APL expressions ~600 different variables are created for use by different expressions Each expression is repeated for approximately 3-4 seconds –Currently split into 10 runs of <0.5 secs each Primitive Performance 8

100 Expressions (selected at random) +z2 ×i2 |l2 s4 ÷zn0 xs4+ys4 xi1×yb1 xl2|yi2 xi4 yl4 xi1<ys1 xi1ys1 xl0>yb0 xi0yl0 ¯11yd1 xa4 ya2 xb2 yi1 xb2 ys2 xs0 yl2 xs0ys0 (... and about 13,900 others) Primitive Performance 9 xs1~ys0 xs4 yb4 xi2 yl0 xd0~yd1 xz1 yz1 bw4 +\iw2 \dw2 \bt1 lt4 -/bq2 \bq1 sq4 xb2.+yb1 10[0] bw1 10[0] zt0 ¯10[0] bt1,dv1 av2 lv4 iv1 at1,xq2,zq1 aq4 s xv4 s dv2 ¯10lv0 bv1,sv1 sv4,dv4 av4 lv4 xv0 av0 xv0 dv4 iv4zv0 lv0zv4 dv0xv0 dv0 iv0 dv4 bv4 s zv4 ¯1aw0 ¯1lw2 10lw2 11 ¯10sw2 ¯1 xw4 ¯10 lw1 sw2,iw1 zw2,lw2 xw0 dw4 xw4 sw4 bw4 lw0 iw4sw0 iw4 dw0 iw4 dw4 zw0 sw4 zw4xw4 zw4 sw4 s at2 10it2 ¯10dt4 11 ¯1at0 11 ¯1zt2 10 bt2 ¯10 bt4 zt0 it2 zt2 lt2 st0,lt0 at4 dt0 xt4 it0 st0it0 lt4zt4 11 1bq0 11 ¯10sq2 (j0 k) xv4 (k k4) zw4 bv2[j0]bv0 dv2[j0]dv0 paren10 { [ ]}iv4

Variables (~600) left (x) or right (y) argument datatype b - boolean, 1 bit s - short integer, 1 byte i - integer, 2 bytes l - long integer, 4 bytes d - double, 8 bytes z - complex, 2 doubles, 16 bytes f - DECF a - alphanumeric, 1-byte character x – enclosed (char vectors) length, usually number of elements, but can be number of rows or number of columns 0 - 1e0, vector or scalar or singleton 1 - 1e1 2 - 1e2 4 - 1e4 Examples: zn0: complex non-zero scalar l4: 10,000 element long integer sp1: 10 element 1-byte ints >1 kind of array v - vector (but *v0 is a scalar) t - tall matrix, 11-column matrix with 10*0 1 2 4 rows w - wide matrix, 11-row matrix with 10*0 1 2 4 columns q - square matrix with 10*0 1 2 4×0.6 (1 4 16 252) rows/columns domain (for variables used in scalar functions) n - non-zero p - positive and ~ 0 1 u - unit circle; used in inverse trig functions special c k - scalar indices ?6 i – int vector of file com nos or native file indices j0 j1 j2 - index vectors of length 1e0 1e1 1e2 k1 k2 k4 - index vectors of length 7 in the range 1e1 1e2 1e4 bvc svc ivc lvc... - 11-element vectors of various types d2: 100 element double bt2: 100x11 boolean matrix xw4: 11x10000 matrix of enclosed char vectors Primitive Performance 10

Repeatable Timings... Use a dedicated machine (real, NOT virtual!) –At Dyalog: 4 cores, 96Gb RAM, nothing installed except APL Run processes at high or realtime priority Pre-expand workspaces using 2000 Control workspace compactions carefully Carefully craft the execution loop to have minimum variable overhead Primitive Performance 11

The Inner Loop (1/2) ra TIMEX b;ai;cnt;n;rep;min;e;sum;kt;m [1] cnt0 We will try 3 times [2] :Repeat [3] ktGetPrivilegedProcessorTime Will be checked at end [4] cntcnt+1 [5] ai AI Record CPU & Elapsed time [6] {} WA Compact workspace [7] pqa_cal_wait Check time of calibration expression [8] pqa_redef b ensure args are in new pockets [9] :If 0reppqa_REPS[pqa_I] r minpqa_TIME[pqa_I;1] eb (use reps set in file to be compared with) [10] :Else [11] min1 /r10 timefx eb [12] :If /mpqa_reps_EXPR.=(¯1 pqa_reps_EXPR)b [13] reppqa_reps_REPS[m 1] [14] :Else rep1 pqa_rep_ticks pqa_rep_ticks÷min reps required to get 200 ticks (70 microsec) [15] :EndIf [16] :EndIf Primitive Performance 12

Recorded Data The complete distribution of [several thousand] timings for each expression is recorded –The inner loop size for each expression is recorded and can be used as input to the next recording to create [more] comparable timings Deciding what the data means in not easy... Primitive Performance 13

Reporting Providing useful reports on such a large quantity of data is a huge challenge. Report needs to quickly identify bad (and good) news, without false positives. –A report with many false positives is worse than useless Current run-time for data collection is ~13 hours, which makes tool hard to use during development Primitive Performance 14

The Hardest Part (for me) Primitive Performance 15

A More Interesting Report Primitive Performance 16

Primitive Performance 17 V14.0 With 3 Months To Go

Planned Work Finalize report format and issue official 13.2 report – then 14.0 Hook reporting tool up to internal (MiServer-based) web server so entire development team can drill down and schedule runs Create shorter test and web-based scheduler for ad hoc use by developers needing short turnaround to verify a change Bring APLMON categories in line with PQA, so an APLMON profile can be combined with PQA data to predict performance (might work) Holy Grail: Hook PQA up to overnight build system, so updates are blocked if a fix causes a degradation (and responsible developer fined for not running the test himself) Primitive Performance 18

Credits Most of the real work done by Roger Hui –(and most of the tuning, too!) Morten is still working on getting reproducible/stable numbers and reporting Primitive Performance 19

October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13.

Similar presentations

Presentation on theme: "October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13.

Similar presentations

Presentation on theme: "October 20-24. Primitive Performance Roger Hui, Morten Kromberg Dyalog LTD Dyalog13."— Presentation transcript:

Similar presentations

About project

Feedback