Presentation is loading. Please wait.

Presentation is loading. Please wait.

University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.

Similar presentations


Presentation on theme: "University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional."— Presentation transcript:

1 University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional dimension for each level of the hierarchy. –Envision data as reshaped to reflect increased dimensionality. –Calculus automatically transforms algorithm to reflect reshaped data array. –Data, layout, data movement, and scalarization automatically generated based on reshaped data array.

2 University at Albany,SUNY lrm-2 lrm 6/28/2016 Levels of Processor/Memory Hierarchy continued Math and indexing operations in same expression Framework for design space search –Rigorous and provably correct –Extensible to complex architectures Approach Mathematics of Arrays Example:“raising” array dimensionality y= conv (x)(x) Memory Hierarchy Parallelism Main Memory L2 Cache L1 Cache Map x: Map: P0 P1 P2 P0 P1 P2

3 University at Albany,SUNY lrm-3 lrm 6/28/2016 Application Domain Signal Processing 3-d Radar Data Processing Composition of Monolithic Array Operations Algorithm is Input Architectural Information is Input Hardware Info: - Memory - Processor Change algorithm to better match hardware/memory/ communication. Lift dimension algebraically Pulse Compression Doppler Filtering BeamformingDetection Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1) Model All Three: dim=dim+3 ConvolutionMatrix Multiply

4 University at Albany,SUNY lrm-4 lrm 6/28/2016 Some Psi Calculus Operations

5 University at Albany,SUNY lrm-5 lrm 6/28/2016 Convolution: PSI Calculus Description PSI Calculus operators compose to form higher level operations Definition of y=conv(h,x) y[n]=where x ‘ has N elements, h has M elements, 0≤n<N+M-1, and x’ is x padded by M-1 zeros on either end Algorithm and PSI Calculus Description Algorithm step Psi Calculus sum Y=unaryOmega (sum, 1, Prod) Initial step x= h= rotate x’ (N+M-1) times x’ rot =binaryOmega(rotate,0,iota(N+M-1), 1 x’) Form x’ x’=cat(reshape(, ), cat(x, reshape(, )))= take the size of h part of x’ rot x’ final =binaryOmega(take,0,reshape,,=1,x’ rot multiply Prod=binaryOmega (*,1, h,1,x’ final ) x= h= x’ rot = x’ final = Prod= Y= x’=

6 University at Albany,SUNY lrm-6 lrm 6/28/2016 Experimental Platform and Method Hardware DY4 CHAMP-AV Board –Contains 4 MPC7400’s and 1 MPC 8420 MPC7400 (G4) –450 MHz –32 KB L1 data cache –2 MB L2 cache –64 MB memory/processor Software VxWorks 5.2 –Real-time OS GCC 2.95.4 (non-official release) –GCC 2.95.3 with patches for VxWorks –Optimization flags: -O3 -funroll-loops -fstrict-aliasing Method Run many iterations, report average, minimum, maximum time –From 10,000,000 iterations for small data sizes, to 1000 for large data sizes All approaches run on same data Only average times shown here Only one G4 processor used Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results Use of the VxWorks OS resulted in very low variability in timing High degree of confidence in results

7 University at Albany,SUNY lrm-7 lrm 6/28/2016 Convolution and Dimension Lifting Model Processor and Level 1 cache. –Start with 1-d inputs(input dimension). –Envision 2 nd dimension ranging over output values. –Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. –Envision Cache Reshaped into a 4 th dimension. The 1 st dimension is partitioned. –“psi” Reduce to Normal Form

8 University at Albany,SUNY lrm-8 lrm 6/28/2016 – Envision 2 nd dimension ranging over output values. Let tz=N+M-1 M=  h=4 N=  x tz h3h3 h2h2 h1h1 h0h0 000x0x0  x 4

9 University at Albany,SUNY lrm-9 lrm 6/28/2016 2  x  x tz 2 2 - Envision Processors Reshaped into a 3 rd dimension. The 2 nd dimension is partitioned. Let p = 2 -4 -4 -

10 University at Albany,SUNY lrm-10 lrm 6/28/2016 – Envision Cache Tz/2  x  2 x 2 Reshaped into a 4 th dimension The 1 st dimension is partitioned. tz/2 Tz/2  x  2 x 2 2 tz 2 2 2

11 University at Albany,SUNY lrm-11 lrm 6/28/2016 ONF for the Convolution Decomposition with Processors & Cache Generic form- 4 dimensional after “psi” Reduction 1.For i 0 = 0 to p-1 do: 2.For i 1 1= 0 to tz/p –1 do: 3.sum  0 4.For icache row = 0 to M/cache -1 do: 5.For i 3 = 0 to cache –1 do: 6.sum  sum + h [(M-((icache row  cache) + i 3 ))-1]  x’[(((tz/p  i 0 )+i 1 ) + icache row  cache) + i 3 )] Let tz=N+M-1 M=  h N=  x Time Domain ProcessorloopProcessorloop TImeloopTImeloop CacheloopCacheloop sum is calculated for each element of y.

12 University at Albany,SUNY lrm-12 lrm 6/28/2016 Outline Overview Array Algebra: MoA and Index Calculus: Psi Calculus Time Domain Convolution Other algorithms in Radar –Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments –Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments –FFT Benefits of Using Moa and Psi Calculus

13 University at Albany,SUNY lrm-13 lrm 6/28/2016 ONF for 1 proc Algorithms in Radar Time Domain Convolution (x,y) Modified Gram Schmidt QR (A) A x (B H x C) Beamforming Manual description & derivation for 1 processor DNF Lift dimension - Processor - L1 cache reformulate DNFONF Mechanize Using Expression Templates Use to reason about RAW Benchmark at NCSA w/LAPACK Compiler Optimizations DNF to ONF Implement DNF/ONF Fortran 90 Thoughts on an Abstract Machine MoA &  Calculus

14 University at Albany,SUNY lrm-14 lrm 6/28/2016 Benefits of Using Moa and Psi Calculus Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level. Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities –Matches memory hierarchy levels –Avoids materialization of intermediate arrays. Algorithm can be automatically(algebraically) transformed to reflect array reshapings above. Facilitates programming expressed at a high level –Facilitates intentional program design and analysis –Facilitates portability This approach is applicable to many other problems in radar.

15 University at Albany,SUNY lrm-15 lrm 6/28/2016 ONF for the QR Decomposition with Processors & Cache Modified Gram Schmidt MainLoopMainLoop Processor Loop Processor Cache Loop Processor Cache Loop Initialization Compute Norm Normalize DoT Product Ortothogonalize

16 University at Albany,SUNY lrm-16 lrm 6/28/2016 DNF for the Composition of A x (B H x C) Generic form- 4 dimensional 1.Z=0 2.For i=0 to n-1 do: 3.For j=0 to n-1 do: 4.For k=0 to n-1 do: 5.z[k;]  z[k;]+A[k;j]xX[j;i]xB[i;] Given A, B, X, Z n by n arrays Beamforming

17 University at Albany,SUNY lrm-17 lrm 6/28/2016 Typical C++ Operator Overloading temp B+Ctemp temp copyA Main Operator + Operator = 1. Pass B and C references to operator + 2. Create temporary result vector 3. Calculate results, store in temporary 4.Return copy of temporary 5. Pass results reference to operator= 6. Perform assignment temp copy temp copy & Example: A=B+C vector add B&, C& Additional Memory Use Additional Execution Time Static memory Dynamic memory (also affects execution time) Cache misses/ page faults Time to create a new vector Time to create a copy of a vector Time to destruct both temporaries 2 temporary vectors created

18 University at Albany,SUNY lrm-18 lrm 6/28/2016 C++ Expression Templates and PETE Parse trees, not vectors, created Reduced Memory Use Reduced Execution Time Parse tree only contains references Better cache use Loop fusion style optimization Compile-time expression tree manipulation PETE: http://www.acl.lanl.gov/pete PETE, the Portable Expression Template Engine, is available from the Advanced Computing Laboratory at Los Alamos National Laboratory PETE provides: –Expression template capability –Facilities to help navigate and evaluating parse trees A=B+C BinaryNode, Reference > Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created + B C

19 University at Albany,SUNY lrm-19 lrm 6/28/2016 Implementing Psi Calculus with Expression Templates Example: A=take(4,drop(3,rev(B))) B= A= Recall: Psi Reduction for 1-d arrays always yields one or more expressions of the form: x[ i ]=y[stride* i + offset] l ≤ i < u 1. Form expression tree 2. Add size information 3. Apply Psi Reduction rules 4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root take drop rev B 4 3 take drop rev B 4 3 size=4 size=7 size=10 size=4 iterator: offset=6 stride=-1 size=4A[ i ]=B[- i +6] size=7A[ i ]=B[-( i +3)+9] =B[- i +6] size=10A[ i ]=B[- i +B.size-1] =B[- i +9] size=10A[ i ]=B[ i ] Size info Reduction Iterators used for efficiency, rather than recalculating indices for each i One “for” loop to evaluate each sub-expression


Download ppt "University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional."

Similar presentations


Ads by Google