Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.

Similar presentations


Presentation on theme: "1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work."— Presentation transcript:

1 1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany

2 2 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia

3 3 Hybrid architectures in Top 500 [Nov’10]

4 4 Hybrid architectures –High compute power / memory bandwidth –Energy efficient [operated today at low efficiency] Agenda for this talk –GPU Architecture Intuition What generates the above characteristics? –Progress on efficiently harnessing hybrid (GPU-based) architectures

5 5 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

6 6

7 7

8 8

9 9

10 10 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

11 11 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

12 12 Feed the cores with data Idea #3 The processing elements are data hungry!  Wide, high throughput memory bus

13 13 10,000x parallelism! Idea #4 Hide memory access latency  Hardware supported multithreading

14 14 The Resulting GPU Architecture Multiprocessor 2 Multiprocessor N GPU Core M Instruction Unit Shared Memory Registers Multiprocessor 1 Core 1 Registers Core 2 Registers Global Memory Texture Memory Constant Memory nVidia Tesla 2050  448 cores  Four ‘memories’ Shared fast – 4 cycles small – 48KB Global slow – 400-600cycles large – up to 3GB high throughput – 150GB/s Texture – read only Constant – read only  Hybrid PCI 16x -- 4GBps Host Memory Host Machine

15 15 GPUs offer different characteristics  High peak compute power  High host-device communication overhead  Complex to program  High peak memory bandwidth  Limited memory space

16 16 Projects at NetSysLab@UBC http://netsyslab.ece.ubc.ca Porting applications to efficiently exploit GPU characteristics Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February 2011. Middleware runtime support to simplify application development CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR GPU-optimized building blocks: Data structures and libraries GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09 Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08

17 17 Motivating Question: How should we design applications to efficiently exploit GPU characteristics? Context: A bioinformatics problem: Sequence Alignment  A string matching problem  Data intensive (10 2 GB) Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10

18 18 Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]:  A GPU port of the sequence alignment tool MUMmer [Kurtz 04]  ~4x (end-to-end) compared to CPU version Hypothesis : mismatch between the core data structure ( suffix tree ) and GPU characteristics > 50% overhead (%)

19 19  Use a space efficient data structure (though, from higher computational complexity class): suffix array  4x speedup compared to suffix tree-based on GPU Idea: trade-off time for space Consequences:  Opportunity to exploit multi-GPU systems as I/O is less of a bottleneck  Focus is shifted towards optimizing the compute stage Significant overhead reduction

20 20 Outline for the rest of this talk  Sequence alignment: background and offloading to GPU  Space/Time trade-off analysis  Evaluation

21 21 CCAT GGCT........CGCCCTA GCAATTT....... GCGG...TAGGC TGCGC......CGGCA......GGCG...GGCTA ATGCG….…TCGG... TTTGCGG…....TAGG...ATAT….…CCTA... CAATT…...CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Background: Sequence Alignment Problem Problem: Find where each query most likely originated from  Queries  10 8 queries  10 1 to 10 2 symbols length per query  Reference  10 6 to 10 11 symbols length (up to ~400GB) Queries Reference

22 22 GPU Offloading: Opportunity and Challenges Sequence alignment  Easy to partition  Memory intensive GPU  Massively parallel  High memory bandwidth Opportunity  Data Intensive  Large output size  Limited memory space  No direct access to other I/O devices (e.g., disk) Challenges

23 23 GPU Offloading: addressing the challenges subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space →divide and compute in rounds →search-optimized data- structures Large output size →compressed output representation (decompress on the CPU) High-level algorithm (executed on the host)

24 24 Space/Time Trade-off Analysis

25 25 The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])  Search: O(qry_len) per query  Space: O(ref_len) but the constant is high ~20 x ref_len  Post-processing: DFS traversal for each query O(4 qry_len - min_match_len )

26 26 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07])  Search: O(qry_len) per query  Space: O(ref_len), but the constant is high: ~20xref_len  Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Expensive Efficient

27 27 A better matching data structure? Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len Search O(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 1: Reduced communication Less data to transfer Compute

28 28 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 2: Better data locality is achieved at the cost of additional per-thread processing time Space for longer sub- references => fewer processing rounds Compute

29 29 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 3: Lower post-processing overhead Compute

30 30 Evaluation

31 31 Evaluation setup Workload / Species Reference sequence length # of queries Average read length HS1 - Human (chromosome 2) ~238M~78M~200 HS2 - Human (chromosome 3) ~100M~2M~700 MONO - L. monocytogenes~3M~6M~120 SUIS - S. suis~2M~26M~36  Testbed  Low-end Geforce 9800 GX2 GPU (512MB)  High-end Tesla C1060 (4GB)  Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])  Success metrics  Performance  Energy consumption  Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)http://www.ncbi.nlm.nih.gov/Traces

32 32 Speedup: array-based over tree-based

33 33 Dissecting the overheads Significant reduction in data transfers and post- processing Workload: HS1, ~78M queries, ~238M ref. length on GeForce

34 34 Comparing with CPU performance [baseline single core performance] [Suffix tree] [Suffix array]

35 35 Summary  GPUs have drastically different performance characteristics  Reconsidering the choice of the data structure used is necessary when porting applications to the GPU  A good matching data structure ensures:  Low communication overhead  Data locality: might be achieved at the cost of additional per thread processing time  Low post-processing overhead

36 36 Code, benchmarks and papers available at: netsyslab.ece.ubc.ca


Download ppt "1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work."

Similar presentations


Ads by Google