GPU DAS CSIRO ASTRONOMY AND SPACE SCIENCE Chris Phillips 23 th October 2012.

Slides:

Advertisements

Similar presentations

E-VLBI down-under Tasso Tzioumis ATNF, CSIRO July 2005 Towards e-VLBI.

Advertisements

Digital FX Correlator Nimish Sane Center for Solar-Terrestrial Research New Jersey Institute of Technology, Newark, NJ EOVSA Technical Design Meeting.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

© Fastvideo, Key Points We implemented the fastest JPEG codec Many applications using JPEG can benefit from our codec.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Ongoing e-VLBI Developments with K5 VLBI System Hiroshi Takeuchi, Tetsuro Kondo, Yasuhiro Koyama, and Moritaka Kimura Kashima Space Research Center/NICT.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Xcube Recorder Chris Phillips. X-cube Data Logger 100% COTS hardware recorder 16 Gbps capability – 32 Gbps possible – External replaceable storage modules.

GPGPU platforms GP - General Purpose computation using GPU

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

An Introduction to Programming with CUDA Paul Richmond

SAGE: Self-Tuning Approximation for Graphics Engines

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Current LBA Developments Chris Phillips CSIRO ATNF 13/7/2005.

Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS.

Backend electronics for radioastronomy G. Comoretto.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Update on LBA Technical Developments CASS Chris Phillips 11 October 2013.

Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

DiFX Performance Testing Chris Phillips eVLBI Project Scientist 25 June 2009.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

by Arjun Radhakrishnan supervised by Prof. Michael Inggs

Short introduction Pulsar Parkes. Outline PDFB – Single beam pulsar timing system CASPER – Single beam pulsar coherent dedispersion system.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

A real-time software backend for the GMRT : towards hybrid backends CASPER meeting Capetown 30th September 2009 Collaborators : Jayanta Roy (NCRA) Yashwant.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

ATCA GPU Correlator Strawman Design ASTRONOMY AND SPACE SCIENCE Chris Phillips | LBA Lead Scientist 17 November 2015.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CUDA programming Performance considerations (CUDA best practices)

VGOS GPU Based Software Correlator Design Igor Surkis, Voytsekh Ken, Vladimir Mishin, Nadezhda Mishina, Yana Kurdubova, Violet Shantyr, Vladimir Zimovsky.

FP7 Uniboard project Digital Receiver G. Comoretto, A. Russo, G. Tuccari, A Baudry, P. Camino, B. Quertier Dwingeloo, February 27, 2009.

CUDA C/C++ Basics Part 2 - Blocks and Threads

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

JIVE UniBoard Correlator (JUC) Firmware

Testing Multibit Encoding Schemes

Programming Massively Parallel Graphics Processors

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

NVIDIA Fermi Architecture

Programming Massively Parallel Graphics Processors

General Purpose Graphics Processing Units (GPGPUs)

17/04/2019 Future VLBI systems Tasso Tzioumis| Facilities Program Director– Technologies for Radio Astronomy CSIRO Astronomy and Space Science IVTW.

6- General Purpose GPU Programming

Presentation transcript:

GPU DAS CSIRO ASTRONOMY AND SPACE SCIENCE Chris Phillips 23 th October 2012

VLBI DAS – What is needed Convert analog signal to digital Format data including time code In the age of disk recording and software correlation anything else is “optional” 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 2 |

VLBI DAS – What is desirable Channelization Polyphase filterbank Re-quantize data Reduced data rate Tsys extraction Digital linear to circular conversion Plus lots more – phasecal extraction etc 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 3 |

“Traditional DAS” High speed (multi-bit) sampler and FPGA for processing RDBE, DBBC etc FPGA high compute capability, low power usage and compact Ideal for VLBI usage But… 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 4 |

GPU Processing GPUs are extremely fast and cheap Nvidia GTX 690 $ Tflops (theoretical) 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 5 |

GPU Arithmetic Performance

GPU Power Efficiency

GPU Memory Bandwidth

1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 9 |

1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 10 |

1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 11 |

GTX 590GTX GPU 512 CUDA Cores L2 cache 768 kB registers/block (SM) 1536 Mbytes global memory 2 GPU 1536 CUDA Cores L2 cache registers/block (SM) 2048 Mbytes global memory 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 12 |

CUDA A parallel computing architecture developed by NVIDIA. Extensions to C(++) to allow massive parallelization running on a GPU Run thousands of threads in parallel Very simple to program 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 13 |

1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 14 | __global__ void multipy_kernel(float *a, const float *b, const float *b) { const size_t i = blockDim.x * blockIdx.x + threadIdx.x; a[i] = b[i]*c[i]; }

GPU DAS Experiment Goal 1 GHz dual pol 8 2bits Assume dual pol sampled at 8 bit real Channelize to 16 MHz Linear to Circular conversion (Tsys) Convert to 2 bit Pack (interleave) 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 15 |

Steps Required Copy data to GPU Convert 8 bit to float De-interleave data FFT (eventually polyphase filterbank) Complex gain correction Linear to circular Calculate RMS Quantization and interleave Testing on GTX 590 ($500) 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 16 |

I/O Bandwidth 1 GHz real bits, dual pol  32 Gbps bandwidthTest Host  GPU 47 Gbps GPU  Host 50 Gbps CUDA allows DMA transfer while processing 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 17 |

Convert to Float 142ms 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 18 | __global__ void unpack8bit_2chan_kernel(float *dest, const int8_t *src, int N) { const size_t i = blockDim.x * blockIdx.x + threadIdx.x; const size_t j = i*2; dest[i] = static_cast (src[j]); dest[i+N] = static_cast (src[j+1]); }

FFT 465 ms 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 19 | cufftPlan1d(&plan, 32, CUFFT_R2C, batch); cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE); cufftExecR2C(plan, in, out);

Linear to Circular 254 ms 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 20 | __global__ void linear2circular_kernel(cuFloatComplex *data, int nchan, int N, cuFloatComplex *gain) { int i = blockDim.x * blockIdx.x + threadIdx.x; int c = i % nchan; cuFloatComplex temp; data[i] = cuCmulf(data[i], gain[c]); data[i+N] = cuCmulf(data[i+N], gain[c+nchan]); temp = cuCsubf(data[i], data[i+N]); data[i+N] = cuCaddf(data[i], data[i+N]); data[i] = temp; }

RMS “Reduction” so potential problematic Took SDK sample for mean calculation Assumed real sampled data 115 msec 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 21 |

Convert to 2bit and interleave Just use if/then/else block with 3 thresholds Specially coded 32 channel dual pol case 5.8 seconds!!! 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 22 |

Convert to 2bit and interleave, try 2 Just use if/then/else block with 3 thresholds Used simple dual pol case 247 msec 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 23 |

Summary But 2 GPU so OK 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 24 | Time (msec) Transfer to GPUOK Unpack and convert to float142 FFT465 RMS115 Circular to Linear253 Quantize and pack247 Total1222

Polyphase filterbank? Tests using FFT channelization Prefer PFB Wisdom says PFB 1.5-2x the compute of FFT Assuming 2x time for FFT and dual GPU gives total computer time as 0.84 seconds 320msec to implement Tsys extraction if simple RMS calculation is not enough 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 25 |

CPU Comparison? GTX 5902x Intel 2.40GHz Unpack and convert to float FFT (PFB)465 (930)1469 (2939) RMS Circular to Linear Quantize and pack Total (sec)0.61 (0.84)3.7 (5.2) 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 26 |

GPU Comparison GTX 590GTX690GTX670*GTX480 Unpack FFT RMS Circular to Linear Pack Total (sec) st International VLBI Technology Workshop: GPU DAS | Chris Phillips 27 | *Single GPU, rest are dual GPU

Conclusion Modern GPU should have enough compute power to cover most VLBA DAS requirements A lot of scope for speed improvements Combine some stages together A working system would allow for a very generic backend VLBI, spectral line, pulsar, fast transient detector… 1st International VLBI Technology Workshop: GPU DAS | Chris Phillips 28 |

CSIRO Astronomy and Space Science t wwww.atnf.csiro.au CSIRO ASTRONOMY AND SPACE SCIENCE Thank you