CO405H Computing in Space with OpenSPL Topic 9: Programming DFEs (Loops II) Oskar Mencer Georgi Gaydadjiev special thanks: Jacob Bower Department of Computing.

Slides:



Advertisements
Similar presentations
Continuation of chapter 6…. Nested while loop A while loop used within another while loop is called nested while loop. Q. An illustration to generate.
Advertisements

WTM London 2012 "Is the industry doing enough to cater for people with disabilities?" David Stratton.
BC: Electron cryo microscopy in structural biology Ardan Patwardhan Dept. of Biological Sciences Imperial College November, 2003.
Introduction to Programming Java Lab 7: Loops 22 February JavaLab7 lecture slides.ppt Ping Brennan
1 March 2013Birkbeck College, U. London1 Introduction to Programming Lecturer: Steve Maybank Department of Computer Science and Information Systems
8 March 2013Birkbeck College, U. London1 Introduction to Programming Lecturer: Steve Maybank Department of Computer Science and Information Systems
Introduction to Programming
Introduction to Programming
About the e-learning course design intensives. Oxford Centre for Staff and Learning Development 2003 at Brookes.
Dependency Test in Loops By Amala Gandhi. Data Dependence Three types of data dependence: 1. Flow (True) dependence : read-after-write int a, b, c; a.
Lesson 6A – Loops By John B. Owen All rights reserved ©2011.
For(int i = 1; i
THE UNIVERSITY of EDINBURGH HEALTH and SAFETY DEPARTMENT Sources of guidance and assistance Health and Safety Office Fire Safety Unit Training & Audit.
Abbas Edalat Imperial College London Interval Derivative.
1 Variant Ownership with Existential Types Nicholas Cameron Sophia Drossopoulou Imperial College London.
void count_down (int count) { for(i=count; i>1; i--) printf(" %d\t", count); } printf("A%d\n", count); if(count>1) count_down(count-1); printf("B%d\n",
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Searching Prudence Wong
Supervised by: Dr Sophia Drossopoulou, Dr Nobuko Yoshida Wildcards, Variance and Virtual Classes.
C++ Basics March 10th. A C++ program //if necessary include headers //#include void main() { //variable declaration //read values input from user //computation.
Artificial Intelligence 0. Course Overview Course V231 Department of Computing Imperial College, London © Simon Colton.
CO405H Computing in Space with OpenSPL Topic 10: Correlation Example Oskar Mencer Georgi Gaydadjiev Department of Computing Imperial College London
Numbers
Chapter 6 – Selected Design Topics Part 4 – Programmable Implementation Technologies Logic and Computer Design Fundamentals.
TAFTS instrument calibration Paul Green, Ralph Beeby and Juliet Pickering, Imperial College London CAVIAR field campaign meeting IC 29 th March 2011 Page.
CS 112 Intro to Computer Science II Sami Rollins Spring 2007.
1 10/9/06CS150 Introduction to Computer Science 1 for Loops.
CHAPTER 2 ANALYSIS OF ALGORITHMS Part 2. 2 Running time of Basic operations Basic operations do not depend on the size of input, their running time is.
AMUSE: Autonomic Management of Ubiquitous Systems for e-Health Dr. Emil C. Lupu Department of Computing Imperial College London, UK
Recursion Examples Fundamentals of CS Case 1: Code /* Recursion: Case 1 */ #include void count (int index); main () { count (0); getchar(); } void count.
 Platform Independent Petri net Editor 2 (PIPE2) CS2650 Distributed Multimedia Systems Wen Xu November 23 rd, 2010.
ICS 102 Computer Programming University of Hail College of Computer Science & Engineering Computer Science and Software Engineering Department.
computer
Repetition Statements while and do while loops
Name Services Clara Curman, Hanna Boman STS
CO405H Computing in Space with OpenSPL Topic 10: Correlation Example Oskar Mencer Georgi Gaydadjiev Department of Computing Imperial College London
11/10/2016CS150 Introduction to Computer Science 1 Last Time  We covered “for” loops.
April 11, 2005 More about Functions. 1.Is the following a function call or a function header? calcTotal(); 2.Is the following a function call or a function.
Source Page US:official&tbm=isch&tbnid=Mli6kxZ3HfiCRM:&imgrefurl=
Pointer Lecture 2 Course Name: High Level Programming Language Year : 2010.
亚洲的位置和范围 吉林省白城市洮北区教师进修学校 郑春艳. Q 宠宝贝神奇之旅 —— 亚洲 Q 宠快递 你在网上拍的一套物理实验器材到了。 Q 宠宝贝打电话给你: 你好,我是快递员,有你的邮件,你的收货地址上面 写的是学校地址,现在学校放假了,能把你家的具体 位置告诉我吗? 请向快递员描述自己家的详细位置!
基 督 再 來 (一). 經文: 1 你們心裡不要憂愁;你們信神,也當信我。 2 在我父的家裡有許多住處;若是沒有,我就早 已告訴你們了。我去原是為你們預備地去 。 3 我 若去為你們預備了地方,就必再來接你們到我那 裡去,我在 那裡,叫你們也在那裡, ] ( 約 14 : 1-3)
10-1 人生与责任 淮安工业园区实验学校 连芳芳 “ 自我介绍 ” “ 自我介绍 ” 儿童时期的我.
Путешествуй со мной и узнаешь, где я сегодня побывал.
CO405H Department of Computing Imperial College London
CO405H Department of Computing Imperial College London
IC6501 CONTROL SYSTEMS Class : III EEE / V SEMESTER
Think What will be the output?
Yahoo Mail Customer Support Number
Most Effective Techniques to Park your Manual Transmission Car
How do Power Car Windows Ensure Occupants Safety
Page 1. Page 2 Page 3 Page 4 Page 5 Page 6 Page 7.
The for-loop and Nested loops
Introduction to pseudocode
STUDENTS, WE VALUE YOUR INPUT!
Слайд-дәріс Қарағанды мемлекеттік техникалық университеті
.. -"""--..J '. / /I/I =---=-- -, _ --, _ = :;:.
THANK YOU!.
Transp Course 2014 Overview.
II //II // \ Others Q.
Thank you.
1 C PIPE"' · I - I. 0I0I .. ) sasa /.
I1I1 a 1·1,.,.,,I.,,I · I 1··n I J,-·
Thank you.
Principles of Computing – UFCFA Lecture-2
CS302 - Data Structures using C++
CS150 Introduction to Computer Science 1
Counter Integrated Circuits (I.C.s)
For Loops Pages
. '. '. I;.,, - - "!' - -·-·,Ii '.....,,......, -,
Presentation transcript:

CO405H Computing in Space with OpenSPL Topic 9: Programming DFEs (Loops II) Oskar Mencer Georgi Gaydadjiev special thanks: Jacob Bower Department of Computing Imperial College London CO405H course page: WebIDE: OpenSPL consortium page:

int count = 0; for (int i=0; i<N; ++i) { for (int j=0; j<M; ++j) { B[count] = A[count]+(i*M)+j; count += 1; } From Loops I: Loop Nest without Dependence 2 DFEVar A = io.input(”input”, dfeUInt(32)); CounterChain chain = control.count.makeCounterChain(); DFEVar i = chain.addCounter(N, 1).cast(dfeUInt(32)); DFEVar j = chain.addCounter(M, 1).cast(dfeUInt(32)); DFEVar B = A + i*100 + j; io.output(”output”, B, dfeUInt(32)); A A + + B B j j 100 i i + + * * Use a chain of counters to generate i and j

for (i = 0; ; i += 1) { float d = input[i]; float v = 2.91 – 2.0*d; for (iter=0; iter < 4; iter += 1) v = v * (2.0 - d * v); output[i] = v; } Loop Unrolling with Dependence 3 DFEVar d = io.input(”d”, dfeFloat(8, 24)); DFEVar TWO= constant.var(dfeFloat(8,24), 2.0); DFEVar v = constant.var(dfeFloat(8,24), 2.91) − TWO*d; for ( int iteration = 0; iteration < 4; iteration += 1) { v = v*(TWO− d*v); } io.output(”output”, v, dfeFloat(8, 24));

Classifying Loops – Attributes and measures Simple Fixed Length Stream Loops – Example vector add – Custom memory controllers Nested Loops – Counter chains – Streaming and unrolling – How to avoid cyclic graphs Variable Length Loops – Convert to fixed length Loops with data dependencies – DFESeqLoop: with a data parallel streaming loop – DFEParLoop: with a sequential streaming loop 4 Overview of Spatializing Loops

DFESeqLoop: Data Parallel Streaming Loop Instead of unrolling, create a sequential inner loop to save space => resource sharing DFESeqLoop lp1= new DFESeqLoop(this, “lp1”, 2); DFEVar in = io.input("in", dfeFloat(8, 24), lp1.itr1); lp1.set_input(in); DFEVar square = lp1.feedback * lp1.feedback; lp1.set_output(square); io.output(”square", square, dfeFloat(8, 24), lp1.done); for i=1 to N // data parallel loop in = stream_in[i]; for j=1 to 2 { // sequential loop lp1 square = in * in; in = square; // feedback } stream_out[i] = square; square in Assume 4 stages implement 1 multiply * lp1.feedback lp1.itr1 lp1.done

DFEParLoop: Sequential Streaming Loop Simple Accumulator Example (actually 4 concurrent accumulators) DFEParLoop lp2 = new DFEParLoop(this, “lp2”); DFEVar in = io.input("in", dfeFloat(8, 24), lp2.ndone); lp2.set_input(dfeFloat(8,24), 0.0); DFEVar result = in + lp2.feedback; lp2.set_output(result); io.output(”result", result, dfeFloat(8, 24), lp2.done); for j=1 to 4: out[j]=0.0; for i=1 to N: // sequential loop for j=1 to 4: // data parallel loop out[j] = out[j]+stream_in[i]; interleaved(4) stream_in Adder (Accumulator) assuming 4 pipeline stages Of course j could be a lot larger, but we do 4 at a time here since we assume 4 stages in a + interleaved(4) result lp2.feedback lp2.ndon e lp2.done CPU code, SAPI.h: get loop size: mget_loopLength() returns 4

7 class DFESeqLoop extends KernelLib { DFEVar itr1, done, feedback, feed_in; OffsetExpr loop; DFESeqLoop(Kernel owner, String loop_name, int loop_itrs) { super(owner); loop = stream.makeOffsetAutoLoop(loop_name); DFEVar pipe_len = loop.getDFEVar(this, dfeUInt(32)); DFEVar global_pos = control.count.simpleCounter(32); itr1 = global_pos < pipe_len; done = global_pos >= (pipe_len * loop_itrs); } void set_input(DFEVar loop_in) { feed_in = loop_in.getType().newInstance(this); feedback = itr1 ? feed_in : loop_in; // feed_in in the first iteration } void set_output(DFEVar result) { feed_in <== stream.offset(result, -loop); // connect the loop } DFESeqLoop lp1= new DFESeqLoop(this, “lp1”, 2); DFEVar in = io.input("in", dfeFloat(8, 24), lp1.itr1); lp1.set_input(in); DFEVar square = lp1.feedback * lp1.feedback; lp1.set_output(square); io.output(”square", square, dfeFloat(8, 24), lp1.done); DFESeqLoop details (optional)

8 class DFEParLoop extends KernelLib { DFEVar feed_in, pipe_len, global_pos, feedback, done, ndone; OffsetExpr loop; DFEParLoop(Kernel owner, String loop_name) { super(owner); loop = stream.makeOffsetAutoLoop(loop_name); pipe_len = loop.getDFEVar(this,dfeUInt(32)); // ParLoop iterates as long as there is data DFEVar stream_len = io.scalarInput(loop_name + "_len", dfeUInt(32)); global_pos = control.count.simpleCounter(32); done = global_pos >= (stream_len + pipe_len); ndone = global_pos < stream_len; } void set_input(DFEType fb_type, double init) { feed_in = fb_type.newInstance(this); DFEVar start_feedback = global_pos < pipe_len; feedback = start_feedback ? feed_in : init; } void set_output(DFEVar result) { feed_in <== stream.offset(result, -loop); // connect the loop } DFEParLoop lp2 = new DFEParLoop(this, “lp2”); DFEVar in = io.input("in", dfeFloat(8, 24), lp2.ndone); lp2.set_input(dfeFloat(8,24), 0.0); DFEVar result = lp2.feedback + in; lp2.set_output(result); io.output(”result", result, dfeFloat(8, 24), lp2.done); DFEParLoop details (optional)

Data Loops need Cyclic (Round Robin) interleaved Data Conversion can be done at runtime on the CPU in Software or on the DFE as Dataflow OR interleaving and de-interleaving can be pre-computed and stored in memory DFEVar streams interleave(4) de-interleave(4)

Multiple Loops: n-Body problem […] // all the below are interleaved data streams DFEVar rx = pjX - piX; DFEVar ry = pjY - piY; DFEVar rz = pjZ - piZ; DFEVar dd = rx*rx + ry*ry + rz*rz + scalars.EPS; DFEVar d = 1 / (dd * KernelMath.sqrt(dd)); DFEVar s = pjM * d; DFEParLoop lp = new DFEParLoop(this, “lp”); lp.set_inputs(3, dfeFloat(8,24), 0.0); DFEVar accX = lp.feedback[0] + rx*s; DFEVar accY = lp.feedback[1] + ry*s; DFEVar accZ = lp.feedback[2] + rz*s; lp.set_outputs(accX, accY, accZ); […]

Ok, so how does this really work… 11 int count = 0; for (int i=0; ; i += 1) { sum[i] = 0.0; for (int j=0; j<M; j += 1) { sum[i] = sum[i] + input[count]; count += 1; } output[i] = sum[i]; } DFEVar LoopCount = control.count.simpleCounter(32, M); DFEVar carry = scalarType.newInstance(this); DFEVar sum = LoopCount.eq(0) ? 0.0 : carry; sum = input + sum; carry.connect(stream.offset(sum, −13)); // feedback fifo buffer io.output(”output”, sum, scalarType, LoopCount.eq(M − 1)); -13

Pipeline Depth, Why The multiplexer has a pipeline depth of 1 The floating-point adder has a pipeline depth of 12 Total loop latency = 13 ticks, carry.connect(stream.offset(sum,−13)); luckily stream.makeOffsetAutoLoop() figures out the loop length for us. Now on the software side we need to interleave the input stream with a stride of 13. CPU call: get_loopLength() Generated by the compiler for every loop in the Kernel will return 13! See SAPI.h interface…

Output[0] input[0,0], input[0,1], input[0,2]... input[1,0], input[1,1], input[1,2]... input[2,0], input[2,1], input[2,2]... input[3,0], input[3,1], input[3,2]... Output[1] Output[2] Output[3] After an initial pipeline fill phase, all 13 pipeline stages are occupied 13 independent summations are computed in parallel 0.0 time

Pipeline Depth and Loop-Carry Distance 14 for (i=0;i<N;i++){ for (j=0;j<M;j++) v[i]=v[i]*v[i]+1; // distance 1 } Now the j-loop has a loop-carried dependency with distance 1, i.e. each loop needs the v[i] result of the previous loop, BUT the v[i]*v[i]+1 operations have X stages and thus take X clock cycles. Note that v[i]s are independent, i.e. the i-loop has no dependency!  We thus need X activities (v[i]s) to be in the loop at all times to fully utilize all stages of the multiplication pipeline. for (i=0;i<N/X;i++){ for (j=0;j<M;j++) for (k=0;k<X;k++) // distance X v[i*X+k]=v[i*X+k]*v[i*X+k]+1; } X

15 Computing with 2D Arrays: Loop interchange Example: Row-wise summation is serial due to chain of dependence Column-wise summation would be easy of course So we can keep the pipeline in a cyclic data datapath full by flipping the problem – ie by interchanging the loops

Loop Tiling reduces FMEM requirement Idea: sum a block of rows at a time (“tiling”) We can choose the tile size Just big enough to fill the pipeline so no unnecessary buffering is needed c is the length of the feedback loop, depending on the number format for the accumulator! 16

Loop Tiling reduces FMEM requirement What if we need a particular loop length because of the particular size of our matrix? We can set loopLength to any number larger than the minimum LoopLength: 17 DFEParLoop lp2 = new DFEParLoop(this, “lp2”); DFEVar in = io.input("in", dfeFloat(8, 24), lp2.ndone); lp2.set_input(dfeFloat(8,24), 0.0); DFEVar result = lp2.feedback + in; lp2.set_output(result, 16 ); // set loopLength to 16 io.output(”result", result, dfeFloat(8, 24), lp2.done); However, the larger the loopLength, the more resources are needed globally. Therefore, for maximal efficiency, loopLength should be as small as possible…

Summary: Feedback Loops for Computing in Space If an unrolled loop does not fit => DFESeqLoop For a loop with a loop-carried data dependence which cannot be unrolled, we need to create a loop in the data flow graph => DFEParLoop Interchanging loops and reorganising computations can reduce resource requirements Splitting loops into blocks (“tiling”) allows us to control the amount of buffering required 18