Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

Similar presentations


Presentation on theme: "Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,"— Presentation transcript:

1 Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, Mohamed Abouellail 1, Nandakishore Sastry 2, and Kris Gaj 2 1 The George Washington University, 2 George Mason University Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1, Mohamed Abouellail 1, Nandakishore Sastry 2, and Kris Gaj 2 1 The George Washington University, 2 George Mason University

2 21017 / MAPLD2005El-Araby Outline  Introduction  SRC Hardware & Software  Cray XD1 Hardware & Software  String Matching Algorithms  Implementation Methodology  Results and Comparisons  Conclusions

3 31017 / MAPLD2005El-Araby Introduction Interface  P memory  P memory... PP PP I/O Interface FPGA memory FPGA memory... FPGA... I/O Microprocessor SystemReconfigurable Processor System

4 41017 / MAPLD2005El-Araby Outline  Introduction  SRC Hardware & Software  Cray XD1 Hardware & Software  String Matching Algorithms  Implementation Methodology  Results and Comparisons  Conclusions

5 51017 / MAPLD2005El-Araby  Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier  Up to 256 input and 256 output ports with two tiers of switch  Common Memory (CM) has controller with DMA capability  Controller can perform other functions such as scatter/gather  Up to 8 GB DDR SDRAM supported per CM node SRC Architecture (Hi-Bar TM Based Systems) Storage Area Network Local Area Network Wide Area Network Disk Customers’ Existing Networks PCI-X PCI-X MAP ® SRC-6 MAP PPPP Memory SNAP™ PPPP Memory SNAP Gig Ethernet etc. Common Memory ChainingGPIO SRC Hi-Bar Switch

6 61017 / MAPLD2005El-Araby SRC Reconfigurable Processor

7 71017 / MAPLD2005El-Araby SRC Programming Environment  P system FPGA system HLL (C) HDL (VHDL)

8 81017 / MAPLD2005El-Araby SRC Programming Environment (cnt’d)

9 91017 / MAPLD2005El-Araby Main program Function_1(a, d, e) Function_2(d, e, f) Function_1 Function_2 Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) FPGA …… Macro_1 Macro_2 a b c de FPGA contents after the Function_1 call Program in C or Fortran SRC Programming Environment (cnt’d)

10 101017 / MAPLD2005El-Araby Outline  Introduction  SRC Hardware & Software  Cray XD1 Hardware & Software  String Matching Algorithms  Implementation Methodology  Results and Comparisons  Conclusions

11 111017 / MAPLD2005El-Araby Cray XD1 System Architecture (One Chassis) RapidArray components in a Cray XD1 chassis FPGA and 2 nd RAP are on Expansion Module Compute  12 AMD Opteron 32/64 bit, x86 processors  High Performance Linux RapidArray Interconnect  12 communications processors  1 Tb/s switch fabric Active Management  Dedicated processor Application Acceleration  6 co-processors

12 121017 / MAPLD2005El-Araby Cray XD1 Application Acceleration Interfaces  XC2VP30-50 running at up to 200 MHz  4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s)  16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s)  QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications User Logic ADDR(20:0) D(35:0) Q(35:0) TX RX RapidArray Transport ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) RapidArray Transport Core QDR RAM Interface Core QDR II SRAM RAP Virtex-II Pro

13 131017 / MAPLD2005El-Araby Cray XD1 Development Flow Hardware FlowSoftware Flow Standard Hardware Flow

14 141017 / MAPLD2005El-Araby Cray XD1 Hardware Development Flow Standard Flow Additional High-Level Tools

15 151017 / MAPLD2005El-Araby Design Methodology using Cray XD1  Write application in C for system microprocessor  Identify computation intense routine(s)  Generate a bitstream using Cray Cores (RT & QDRII) and language of choice  Create module in HDL (Verilog, VHDL)  Create module using High Level Language Tools  Validate Module  Synthesize using (XST, Leonardo, Synplify Pro)  Create bitstream using Xilinx place & route tools  Replace routines with Cray API calls  Run Application

16 161017 / MAPLD2005El-Araby Outline  Introduction  SRC Hardware & Software  Cray XD1 Hardware & Software  String Matching Algorithms  Implementation Methodology  Results and Comparisons  Conclusions

17 171017 / MAPLD2005El-Araby String Matching - Introduction  String Matching – detecting the occurrence of a particular substring, called the pattern, in another string, called the text  Types of String matching:  Exact string matching  Approximate string matching  Exact string matching:  Involves match patterns, where they exist completely, that is unbroken and with no irrelevant data in between any letters  Numerous Applications : NIDS, text editing, …etc.  Approximate string matching:  Pattern rarely matches the text completely  Finds application in Computational biology (DNA matching), image detection, handwriting recognition…etc.

18 181017 / MAPLD2005El-Araby  Why align two protein or DNA sequences?  Determine whether they are descended from a common ancestor (homologous)  Infer a common function  Locate functional elements  Infer protein structure, if the structure of one of the sequences is known  Problem:  find the best pairwise alignment of GAATC and CATAC DNA Matching Basics GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C  We need a way to measure the quality of a candidate alignment  Alignment scores consist of two parts:  substitution matrix  gap penalty

19 191017 / MAPLD2005El-Araby PurineAG PyrimidineCT Transition (cheap) Transversion (expensive) 10-5 0 T 10-5 0G 0 10-5C 0 10A TGCA A hypothetical substitution matrix GAAT-C CA-TAC -5 + 10 + ? + 10 + ? + 10 = ? GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17 G--AATC d=-4 CATA--C e=-1 -5 + -4 + -1 + 10 + -4 + -1 + 10 = 5 DNA Matching Basics (cnt’d) Scoring aligned bases Scoring gaps  Linear gap penalty: every gap receives a score of d  Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e

20 201017 / MAPLD2005El-Araby Read sequences A & B Into two arrays Set traceback & Similarity matrix to (A+1) * (B+1) 1’s row & column of Similarity Matrix = 0 Initialize traceback Arrays by setting to -1 (default value) Compute Similarity Matrix [i] [j] Update traceback Array Traceback for best alignments NOTE: Traceback array carries the coordinates of one of three cells involved in the calculation of the cell [i] [j] in the similarity matrix no A A yes Similarity Matrix Complete? Approximate String Matching Algorithm (Smith-Waterman Algorithm)

21 211017 / MAPLD2005El-Araby Outline  Introduction  SRC Hardware & Software  Cray XD1 Hardware & Software  String Matching Algorithms  Implementation Methodology  Results and Comparisons  Conclusions

22 221017 / MAPLD2005El-Araby Software Only Implementation Software/Hardware Implementation Hardware Only Implementation C function for  P C function for MAP VHDL Macro  P System FPGA System Implementation Schemes in SRC

23 231017 / MAPLD2005El-Araby Operational Environment Operational Scenarios for Cray XD1 µP-Initiated Transfers FPGA-Initiated Transfers Write-Only Transfers

24 241017 / MAPLD2005El-Araby Outline  Introduction  SRC Hardware & Software  Cray XD1 Hardware & Software  String Matching Algorithms  Implementation Methodology  Results and Comparisons  Conclusions

25 251017 / MAPLD2005El-Araby Performance Results  Rate = (FPGA freq.) X (cycles/cell) X (# SWPEs)  Opteron Implementation (SSEARCH34) *  100 Million Cell Updates Per Second (CUPS)  Cray Inc. Implementation *  Current unoptimized design  80 MHz X 1 X 32 = 2.56 Billion CUPS (GCUPS)  With optimization  100 MHZ x 1 x 50 = 5.0 GCUPS  With future Virtex 4 FPGA  100 MHZ x 1 x 150 = 15 GCUPS  25x speedup vs. Opteron  Our Implementation  SRC-6  Current unoptimized design » 100 MHz X 1 X (16x16) = 25.6 GCUPS  10x speedup vs. Cray  256x speedup vs. Opteron  Cray XD1  Current unoptimized design » 200 MHz X 1 X (16x16) = 51.2 GCUPS  20x speedup vs. Cray  512x speedup vs. Opteron * CUG’05, New Mexico, May 2005

26 261017 / MAPLD2005El-Araby Conclusions  Smith-Waterman sequence alignment algorithm has been implemented on both SRC-6 and Cray XD1 systems  Similarities and differences are highlighted with regard to:  System hardware architecture  Ease of programming  Programming model  Development time  Hardware/software libraries  Performance  The speed-up vs. microprocessor is reported  Primary bottlenecks limiting the performance of both systems are recognized  The capability to share and port applications between the SRC and Cray systems is explored


Download ppt "Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,"

Similar presentations


Ads by Google