GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links > Sino-German Workshop > Chen Tang > 03.2014DLR.de Chart 1 Chen Tang.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
1 Processes and Threads Creation and Termination States Usage Implementations.
Prasanna Pandit R. Govindarajan
Designing Multi-User MIMO for Energy Efficiency
Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.
Shredder GPU-Accelerated Incremental Storage and Computation
Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.
ABC Technology Project
Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
© 2012 National Heart Foundation of Australia. Slide 2.
Executional Architecture
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
PSSA Preparation.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.
The University of Adelaide, School of Computer Science
GPU Programming using BU Shared Computing Cluster
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
GPU Architecture and Programming
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Computer Engg, IIT(BHU)
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Graphics Processing Unit
CSE 502: Computer Architecture
Presentation transcript:

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links > Sino-German Workshop > Chen Tang > DLR.de Chart 1 Chen Tang Institute of Communication and Navigation German Aerospace Center

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 2

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 3

Introduction and Motivation > Sino-German Workshop > Chen Tang > DLR.de Chart 4 Bidirectional satellite communication Multi-user access issue MF-TDMA (e.g. DVB-RCS) Multiuser Detection (MUD) Increase spectrum efficiency Few practical MUD implementations for satellite systems High complexity Sensitive to synchronization and channel estimation errors

> Sino-German Workshop > Chen Tang > DLR.de Chart 5 Introduction and Motivation NEXT project - Network Coding Satellite Experiment paved the way to the GEO research communication satellite H2Sat. H2Sat: explore and test new broadband (high data rate) satellite communication NEXT Exp 3: Multiuser detection (MUD) for satellite return links Two users transmit at the same frequency and time A transparent satellite return link Main objectives: Develop a MUD receiver in SDR Increase decoding throughput  real-time processing

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 6

> Sino-German Workshop > Chen Tang > DLR.de Chart 7 MUD System Design Multiuser detection (MUD) complexity Optimal MUD proposed by Verdú: exponential complexity on number of users Suboptimal MUD algorithms: e.g. PIC; SIC We use Successive Interference Cancellation (SIC) Linear complexity on number of users Straightforward extension to support more users

> Sino-German Workshop > Chen Tang > DLR.de Chart 8 MUD System Design Successive Interference Cancellation (SIC) Sequentially decode users & cancel interference Multi-stage SIC  improve PER Error propagation Sensitive to channel estimation errors Phase noise Expectation Maximization Channel Estimation (EM-CE) LDPC

> Sino-German Workshop > Chen Tang > DLR.de Chart 9 MUD System Design

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 10

> Sino-German Workshop > Chen Tang > DLR.de Chart 11 GPGPU GPUs are massively multithreaded multi-cores chips Image and video rendering General-purpose computations Ref: Nvidia CUDA_C_Programming_Guide 2013 Nvidia Tesla c2070: 448 cores; 515 GFLOPs of double-precision peak performance

> Sino-German Workshop > Chen Tang > DLR.de Chart 12 GPGPU GPU is specialized for computation-intensive, highly parallel computation (exactly what graphics rendering is about) More transistors for data processing rather than data caching and flow control ALU: Arithmetic Logic Unit Limited number of concurrent threads Server with four hex-core processors  24 concurrent active threads (or 48, if HyperThreading supported) Much more concurrent threads Hundreds-cores of processor more than thousands of concurrent active threads

CUDA Architecture > Sino-German Workshop > Chen Tang > DLR.de Chart 13 In Nov. 2006, first GPU built with Nvidia’s CUDA architecture CUDA: Compute Unified Device Architecture Each ALU can be used for general-purpose computations All execution units can arbitrarily read and write memory Allows to use high-level programming languages (C/C++; OpenCL; Fortran; Java&Python)

> Sino-German Workshop > Chen Tang > DLR.de Chart 14 CUDA Architecture Serial program with parallel kernels Serial code executes in a host (CPU) thread Parallel kernel code executes in many device (GPU) threads Host (CPU) and device (GPU) maintain separate memory spaces

> Sino-German Workshop > Chen Tang > DLR.de Chart 15 LDPC Decoder on GPU Assign one CUDA thread to work on each edge of each check node U1: n = 4800 k = 3200 C 1 C 2 C 3 C n-k V 1 V 2 V 3 V 4 V n …... … U2: n = 4800 k = 2400

> Sino-German Workshop > Chen Tang > DLR.de Chart 16 LDPC Decoder on GPU U1: n = 4800 k = 3200 C 1 C 2 C 3 C n-k V 1 V 2 V 3 V 4 V n …... … U2: n = 4800 k = 2400

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 17

MUD receiver on GPU > Sino-German Workshop > Chen Tang > DLR.de Chart 18 Processing bottlenecks: LDPC channel decoding EM channel estimation Resampling and interference cancellation Data transfer between host and device memory (144GB/s of Nvidia Tesla vs. 8GB/s of PCIe*16) All parts of each single user receiver and interference cancellation on GPU Minimize the latency of intermediate data transfer between host and device memory GPU  CPUGPU  CPU GPU  CPU

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 19

> Sino-German Workshop > Chen Tang > DLR.de Chart 20 Simulation Setup GPU Nvidia Tesla c2070 (1.15GHz) Comparison benchmark: Intel Xeon CPU E5620 (2.4GHz) BPSK modulation Two user terminals (power imbalance: U1 3dB higher than U2) Channel coding: LDPC Irregular Repeat Accumulate Blocklength: 4800 bits U1 coderate: 2/3, U2 coderate: 1/2 Baud-rate: symbols/second  real-time threshold: ca. 85ms (66 kbps)

> Sino-German Workshop > Chen Tang > DLR.de Chart 21 Simulation Result Real-time threshold

Overview Introduction and Motivation MUD System Design GPU CUDA Architecture GPU-accelerated Implementation of MUD Simulation Result Summary > Sino-German Workshop > Chen Tang > DLR.de Chart 22

> Sino-German Workshop > Chen Tang > DLR.de Chart 23 Summary SDR implementation of MUD receiver High flexibility and low cost Extension to support more users GPU acceleration 1.8x ~ 3.8x faster than the real-time threshold Still space to improve New GPU  better performance GPU CUDA is very promising for powerful parallel computing Low learning curve Heterogeneous: mixed serial-parallel programming Scalable CUDA-powered Matlab (MATLAB® with Parallel Computing Toolbox; Jacket™ from AccelerEyes) Days/weeks of simulation  hours

“GNU Radio is a free & open-source software development toolkit that provides signal processing blocks to implement software radios” Software Architecture Main processing of the blocks are in C++ functions processed by CPU on PC > Sino-German Workshop > Chen Tang > DLR.de Chart 24 GNURadio Python Module C++ Shared Library Python Script / GNU Radio Companion Python Script / GNU Radio Companion SWIG

> Sino-German Workshop > Chen Tang > DLR.de Chart 25 GNURadio + CUDA Irregular Repeat Accumulate LDPC(IRA) n = 4800 k = 2400

> Sino-German Workshop > Chen Tang > DLR.de Chart 26 CUDA core CPU CPU monster CUDA monster Thank you ! Q&A ?

> Sino-German Workshop > Chen Tang > DLR.de Chart 27

> Sino-German Workshop > Chen Tang > DLR.de Chart 28 GPGPU Advantages of GPU: High computational processing power High memory bandwidth High flexibility Drawbacks of GPU: Non stand-alone device Bad at serial processing Separate memory space Additional hands-on effort

Comparison of total processing time of MUD between CPU and GPU > Sino-German Workshop > Chen Tang > DLR.de Chart 29