Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

Similar presentations


Presentation on theme: "Aarul Jain CSE520, Advanced Computer Architecture Fall 2007."— Presentation transcript:

1 Aarul Jain CSE520, Advanced Computer Architecture Fall 2007

2  Three versions of Fast Fourier Transform to be implemented on Cell BE simulator and their performance analyzed as the order of FFT is increased.  Fast Fourier Transform on PPE/single SPU.  Data/Task parallel on multiple SPUs. (single buffer v/s double buffer performance comparison.)  Pipelined implementation on multiple SPUs.  Performance :  FFT kernel  DMA data transfer

3 PPE  64bit Power architecture with VMX.  In-order, 2-way SMT.  32KB L1, 512KB L2 Cache. SPE  256 KB local store.  In-order, No speculation.  128 registers for all data types. EIB  Four 16B data rings.  Over 100 outstanding requests.

4  FFT compute intensity O(nlogn)  Implementation on PPU ◦ Cache based memory architecture – No software controlled memory.  Implementation on SPU ◦ Software controlled memory. ◦ Limited Local store memory decides the maximum size of the fft that can be implemented. (Data Structure Size = 16bytes * FFT size => 8K point FFT)

5 PPE EXECUTION TIMESSPE EXECUTION TIMES MEASURED ON PPEMEASURED ON SPEMEASURED ON PPE N(points)NLOGNCYCLES CYCLES(for fft only)CYCLES(for DMA) TOTAL CYCLE TIME CYCLES(THREAD CREATION)DIFFERENCE 10241024068645358285481712542722744 2048225281435511511277534522489228560 4096491523135724689625681452493543210 819210649668382527931041960402524470796 16384229376149032 32768491520551657 N v/s cycles

6  Number of cycles on PPU and SPU scale with order NlogN.  Compute time on single SPU is greater than PPU due to cache misses in PPU. No cache for SPU -> direct local store access.  Very efficient DMA.  Thread creation on SPE very expensive. Thus SPUs need to be dedicated to a particular task for a period of time long enough to recoup the time it took to get it set up.  DIFFERENCE (col 8) TOO LARGE?? Exact reason unknown. Possible reasons: ◦ Cycles for exiting the thread. (Upon exit are entries of Local Store invalidated?) ◦ Profile tool problem. (IBM says that simulator is used for profiling SPEs and not PPEs. Does this mean intrinsics provided for measuring cycles on PPE (__mftb) are not accurate?)

7  Multiple FFTs running on each SPU and each SPU works on different data.  Limitation of local store memory. ◦ Single buffer approach => 8K points ◦ Double buffer approach => 4K points  Single buffer v/s double buffer.  Performance as number of active SPUs are increased.

8 SINGLE BUFFERDOUBLE BUFFER SPUsN CYCLES (THREAD CREATION) AVG. CYCLES(for fft only) AVG. CYCLES(for DMA) CYCLES (THREAD CREATION) AVG. CYCLES(for fft only) AVG. CYCLES(for DMA) (appx.) 11024254275358142.525400535837 21024463515358113.7546247535836 41024886245358104.62591772535843 810242013055358114.375207738535855 120482489211511138.5244411151136 220484584111511105450161151135 42048889951151194.375879061151142 82048197680115111021737721151153 140962493524689312.5253512468941 240964665524689294.75463692468940 440968775324689273.5892232468947 8409617682624689279.8131774592468951 181922524452793520.5 281924589952793508.5 481928913752793498.5 8819218569052793502.3

9  More compute power with multi-processors ◦ For FFT -> almost 8 times if thread creation is not counted.  Using double buffering may not always give speed advantage. (Amdahl’s law)  Careful analysis of algorithm should be done to find out if its compute-intensive or memory-intensive with respect to Cell Architecture. ◦ Matrix multiplication is memory-intensive but FFT will be memory-intensive only for very large orders where all FFT samples cannot fit into Cell Local Store.

10  Reference http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/0AA2394A505EF0FB872570AB005BF0F1 No. of cycles for single 4K point FFT = 24688 No. of floating point operations = 4*1024*log(4*1024) = 49152 Frequency of system = 3.2Ghz No. of SPUs = 8 GFLOPS = (49152/24688) * 8 * 3.2G = 50.96Gflops/sec IBM RESULTS MY RESULTS

11  CELL architecture and its programming environment is completely new. Unknown problems come up.  Runtime error -> “bus error”. Normally because of unaligned access. In my case I was making accesses more than 16K.  Profiling is tricky with simulator supporting multiple modes. Use of assembly intrinsics is required to measure actual cycles. Running in “CYCLE” mode is very slow. ◦ Takes 2 days to run a 8K point fft.  Simulator crashing when mode is changed multiple times.  Debug support very complex.

12  Use the forum alphaworks: excellent forum with quick response time.  To profile accurately run simulation in cycle mode.  Commands for profiling ◦ __mftb() -> FOR PPE ◦ spu_writech(), spu_readch() -> FOR SPE

13  Pipelined implementation of FFT.  Standalone mode.  Higher order FFTs.  Compiler performance.

14  http://www.ibm.com/developerworks/forums/thread.jspa?threadID=160216 http://www.ibm.com/developerworks/forums/thread.jspa?threadID=160216  Cell Broadband Engine Architecture Reference Manual, Ver 1.02, October 11, 2007.  IBM Cell Broadband Engine Software Development Kit, http://alphaworks.ibm.com/tech/cellsw?open&S_TACT=105AGX16&S_CMP=DWPA http://alphaworks.ibm.com/tech/cellsw?open&S_TACT=105AGX16&S_CMP=DWPA  Kahle J. A. et. al., Introduction to the Cell multiprocessor, IBM Journal of Research and Development, September 2005.  Perrone M., Introduction to the Cell Processor (lecture), http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf  Krewell K., Cell Moves Into the Limelight, Microprocessor Report, February 2005.  Krewell K., Chips, Software, and Systems, Microprocessor Report, January 2005.  http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=182042 http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=182042

15

16 loop( mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_get(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); fft_float (FFT_SIZE,cb1.RealIn,cb1.ImagIn,cb1.RealOut,cb1.ImagOut); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_put(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); fft_float (FFT_SIZE,cb2.RealIn,cb2.ImagIn,cb2.RealOut,cb2.ImagOut); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); mfc_put(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); ) mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all();

17 mfc_get(&cb1), argp, sizeof(cb1) x, 0, 0); => WONT WORK FOR cb1>16KB SHOULD BE RECODED AS for (x=0;x<FFT_SIZE/1024;x++) { mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); }


Download ppt "Aarul Jain CSE520, Advanced Computer Architecture Fall 2007."

Similar presentations


Ads by Google