Presentation is loading. Please wait.

Presentation is loading. Please wait.

04/23/2003 Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation.

Similar presentations


Presentation on theme: "04/23/2003 Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation."— Presentation transcript:

1 04/23/2003 Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation

2 04/23/2003CMSC 838T – Presentation 2 Outline u Motivation u Smith-Waterman Algorithm  Parallelization u High Performance Computing  Hybrid Architecture  Fuzion 150 u Performance Evaluation u Conclusions and Comments

3 04/23/2003CMSC 838T – Presentation 3 Motivation Discovered sequences are analyzed by comparison with databases Complexity is proportional to the product of query size times database size ☞ Analysis too slow on sequential computers

4 04/23/2003CMSC 838T – Presentation 4 Sequence Alignment u Two possible approaches  Heuristics, e.g. BLAST, FASTA, but the more efficient the heuristics, the worse the quality of the results  Parallel Processing, get high-quality results in reasonable time u BLAST, FASTA, Smith-Waterman (S-W) BLAST FASTA Smith- Waterman Slower Faster Search Speed Data Quality LowerHigher

5 04/23/2003CMSC 838T – Presentation 5 Outline u Motivation u Smith-Waterman Algorithm  Parallelization u High Performance Computing  Hybrid Architecture  Fuzion 150 u Performance Evaluation u Conclusion and Comments

6 04/23/2003CMSC 838T – Presentation 6 Parallelization of S-W u matrix cells along a single diagonal are computed in parallel u comparison is performed in l 1 +l 2  1 steps on l 1 PEs  G T C T A T C  ATCTCG l2l2 l1l1 P1P1 P2P2 P6P6 0000000 0 0 0 0 0 0 0 0 0 0 0 00 02 0 00 21 0 0 1 00 12 0 2 1 2 4 02 2 1 2 1 2 2 4 3 3 1 0 4 3 2 3 6 6 5 4 5 4 5 5 4 3 4 4 4 56 ATCTCG G T C T A T C G T C T A T C TG C T A T C T A T C CTG TCTG A T C T C ATCTG CTATCTG C T A T C T G

7 04/23/2003CMSC 838T – Presentation 7 Parallel Architectures u Embedded Massively Parallel Accelerators  Fuzion 150: 1536 processors on a single chip  Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan  Systola 1024: PC add-on board with 1024 processors

8 04/23/2003CMSC 838T – Presentation 8 Outline u Motivation u Smith-Waterman Algorithm  Parallelization u High Performance Computing  Hybrid Architecture  Fuzion 150 u Performance Evaluation u Conclusion and Comments

9 04/23/2003CMSC 838T – Presentation 9 Previous Applications u Volume Visualization [ Schmidt `00 ] u Automatic Visual Quality Control (Automobile Industry) u Computer Tomography [ Schmidt, Schimmler, and Schröder `98 ] u Video Compression [ Schmidt and Schimmler `99 ] u Range of Transforms (Fourier, Wavelet, Hough, Radon) [ Schmidt, Schimmler and Schröder `99 ] u Image Processing [ Schimmler and Lang `96, Lenders and Schröder `90, Jiang Edirisinghe, and Schröder `97 ]

10 04/23/2003CMSC 838T – Presentation 10 Hybrid Architecture High speed Myrinet switch Systola 1024 Hybrid Computer  combines SIMD and MIMD paradigm within a parallel architecture  Hybrid Computer

11 04/23/2003CMSC 838T – Presentation 11 Architecture of Systola 1024 u Instruction Systolic Array:  32  32 mesh of processing elements  wavefront instruction execution

12 04/23/2003CMSC 838T – Presentation 12 Mapping onto Systola 1024 a 30 a 31 a0a0 a 63 a 62 a 32 a 992 a 1022 a 1023 b k ….b 1 b 0 …c 1 c 0 X b b: subject sequence a a: query sequence (equal to 1024) u Subject sequences can be pipelined with only step delay  k steps for subject sequence of length k u Efficient routing on the ISA: Row Ringshift and Broadcast

13 04/23/2003CMSC 838T – Presentation 13 Fuzion 150 Architecture u 0.25-  m, single-chip, SIMD architecture u 1536 PEs @ 200 MHz  300 GOPS u 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth u multithreading (control units interact via semaphores) u developed by Clearspeed Technology (UK) for graphics, networking processing Linear SIMD Array 1536 PEs each with 2 Kbytes DRAM Linear SIMD Array 1536 PEs each with 2 Kbytes DRAM FUZION Bus 32-bit EPU (ARC) 32-bit EPU (ARC) Video I/O Video I/O Display Instruction Fetch SIMD Controller Local Memory Local Memory 1,2 or 4 Channels (6.4 GB/s) Host AGPRambus

14 04/23/2003CMSC 838T – Presentation 14 Fuzion 150 Architecture PE (0,0) PE (0,1) PE (0,255) Fuzion Bus PE (1,0) PE (1,1) PE (1,255) PE (5,0) PE (5,1) PE (5,255) Local Memory Local Memory Block 5 Block 1 Block 0 ALU (8 bits) Register file 32 Bytes PE Memory 2 KByte DRAM Right PE Instructions Block I/O Channel Left PE

15 04/23/2003CMSC 838T – Presentation 15 Mapping onto the Fuzion 150 Block 5 Block 1 Block 0 b b: subject sequence b k ….b 1 b 0 a1a1 a0a0 a 255 a 511 a 510 a 256 a 1280 a 1534 a 1535 a a: query sequence (equal to 1536) …c 1 c 0 X u No fast global communication  2-step local communication u Subject sequence can be pipelined with only step delay

16 04/23/2003CMSC 838T – Presentation 16 Contents u Motivation u Smith-Waterman Algorithm  Parallelization u High Performance Computing  Hybrid Architecture  Fuzion 150 u Performance Evaluation u Conclusion and Comments

17 04/23/2003CMSC 838T – Presentation 17 Performance Evaluation u Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths  Parallel implementation scales linearly with sequence length  Computing time dominates data transfer time Query sequence length256512102420484096 Fuzion 150 speedup to PIII 1Ghz 12 88 22 97 42 102 82 105 162 106 Systola 1024 speedup to PIII 1Ghz 294 4 577 4 1137 4 2241 4 4611 4 Cluster of 16 Systolas speedup to PIII 1GHz 20 53 38 56 73 58 142 60 290 59 u Fuzion 150 is  25 times faster than a single Systola 1024; difference in CMOS technology (0.25  vs 1.0  )

18 04/23/2003CMSC 838T – Presentation 18 Performance Evaluation u Time comparisons for a 10 Mbase search on different parallel architectures with different query length  4  faster than 16K-PE MasPar  6  faster than Kestrel  5  faster than SAMBA (special-purpose 3-board architecture)

19 04/23/2003CMSC 838T – Presentation 19 Performance Evaluation USparc : Sun Ultrasparc 140 MHz B-SYS: 470-PE ISA Alpha: DEC Alpha – 433 MHz 1K MP2: 1K-PE MasPar Paragon: 32-node Paragon Decy-1: 1-board Decypher-II * Merc1: 1-board Mercury + Bcll-1: Biocellerator * Samba: 2-board Samba+ 16-MP2: 16K-PE MasPar FDF-3: 5-Board Paracell FDF + Kestrel: 1-board Kestrel Decy-15: 15-board Decypher-II* + (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999

20 04/23/2003CMSC 838T – Presentation 20 Outline u Motivation u Smith-Waterman Algorithm  Parallelization u High Performance Computing  Hybrid Architecture  Fuzion 150 u Performance Evaluation u Conclusions and Comments

21 04/23/2003CMSC 838T – Presentation 21 Conclusions u Demonstrated how fine-grained and hybrid parallel architectures can be applied efficiently for Comparative Genomics u Significant runtime savings for full genome comparisons and database searching u Same systems can be used for accelerating other bioinformatics applications, e.g. Hidden Markov Models

22 04/23/2003CMSC 838T – Presentation 22 Comments  ☞ With hardware support, is S-W as fast as BLAST? Search Tools (against Swiss-Prot DB) Sequence Under Test ELVIS (5)Metr (276)Arp_arath (536) Time taken for the search (seconds) FASTA 3.34.320.025.0 BLAST 2.21.04.010.0 SSearch (SW)6.0240.0565.0 H’Ware Accl.3.216.829.7 Comparative search speeds on 600 MHz 21264A Alpha machine (comparable MCUPS as Hybrid System and Fuzion 150) * Source: Shane Sturrock, SCS, 2(1), April 2002

23 04/23/2003CMSC 838T – Presentation 23 Comments ☞ Is it feasible to use S-W as the default ?  Currently offered as a default option at EBI (European Bioinformatics Institute), handles 15K queries per month w/ full implementation of S-W  Depends on the “objectives” of the search ☞ Just how much more accurate is S-W ?  5-10% more “sensitive” towards divergent matches than BLAST (Shpaer et. al., Genomics 38, 179-191, 1996)  BLAST will retrieve most biologically significant similarities, but will miss a few and will include some chance similarities

24 04/23/2003CMSC 838T – Presentation 24 Comparison of S-W VS BLAST Source: Shpaer et.al., Genomics 38(2), pp.179-191, 1996 ☞ Is there a real difference in the results ? u YES

25 04/23/2003CMSC 838T – Presentation 25 Comparison of S-W, FASTA, and BLAST Note: The numbers in the table show for how many protein SF the method in the column performed better than the one in the row

26 04/23/2003CMSC 838T – Presentation 26 Acknowledgements Dr. Bertil Schmidt Dr. Chau-Wen Tseng

27 04/23/2003CMSC 838T – Presentation 27 Q&A

28 04/23/2003CMSC 838T – Presentation 28 Extra Slides

29 04/23/2003CMSC 838T – Presentation 29 Full Genome Comparison u related Organisms, but Tuberculosis causes a disease  find common and different parts u 16  10 6 pairwise sequence comparisons 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids

30 04/23/2003CMSC 838T – Presentation 30 Smith-Waterman Algorithm u Optimal local alignment of two sequences u Performs an exhaustive search for the optimal local alignment  Complexity O(n  m) for sequence lengths n and m u Based on the 'dynamic programming' (DP) algorithm  Fill the DP matrix using a substitution (mutation) matrix  Find the maximal value (score) in the matrix  Trace back from the score until a 0 value is reached

31 04/23/2003CMSC 838T – Presentation 31 Smith-Waterman Algorithm u Aligning S1 and S2 of length l 1 and l 2 using recurrences: u Calculate three possible ways to extend the alignment  by one aminoacid (AA) in each sequence  by one AA in the first sequence and align it with a gap in the second  by one AA in the second sequence and align it with a gap in the first

32 04/23/2003CMSC 838T – Presentation 32 Smith-Waterman Algorithm ATCTCGTATGATGGTCTATCAC Align S1=ATCTCGTATGATG S2=GTCTATCAC  G T C T A T C A C  ATCTCGTATGATG 000002100210 0 0 0 0 0 0 0 0 0 0 0000000000000 2 0212114321132 0 0 2 1 0 2 1 1 2 2 4 3 2 1 4 3 2 3 6 5 4 3 6 5 4 5 5 4 4 5 5 4 6 5 7 3 4 4 4 5 5 6 3 5 4 6 5 4 5 3 4 7 5 5 7 6 2 5 6 9 8 7 6 1 4 5 8 8 7 6 0 3 6 7 7 10 9 2 2 5 8 7 9 9 2 1 4 7 7 8 8 8 9 7 5 34 2 0  =1,  =1 A T C T C G T A T G A T G G T C T A T C A C G T C  T A T C A C

33 04/23/2003CMSC 838T – Presentation 33 Principles of the ISA.....

34 04/23/2003CMSC 838T – Presentation 34 Principles of the ISA Communication- Register

35 04/23/2003CMSC 838T – Presentation 35 Interface Processors Interface Processors North Interface Processors West ISA..

36 04/23/2003CMSC 838T – Presentation 36 Instruction Systolic Array + row selectors column selectors instructions * - + - * - + * + + * - + + * * +- + + * - + * + * + * - ++ * * -* - + + * + * - - - + * + * - + * - - u wavefront instruction execution  fast accumulation operations (e.g. row sum, broadcast, ringshift)

37 04/23/2003CMSC 838T – Presentation 37 Advantage of ISA’s: Performing Aggregate Functions Row Broadcast Row Sum Row Ringshift C := C[WEST] C := CW C = 234C = 0 234 C := C + C[WEST] noop C = 1C = 2C = 3C = 4 C := C[WEST]; C:=C[EAST] noop C = 1000C = 1 C = 234 C = 0 234 C := CW C = 1C = 3 C:=C+CW C = 3C = 4 C := CW C = 1C = 1000C = 1 C:=CW C:=CE C = 234 C = 0 234 C := CW C = 1C = 3 C:=C+CW C = 6C = 4 C := CW C = 1 C = 1000C = 1 C:=CW C:=CE C = 234 234 C := CW C = 1C = 3C = 6 C:=C+CW C = 10 C := CW C = 1 C = 1000 C:=CW C:=CE

38 04/23/2003CMSC 838T – Presentation 38 Data Transfer u In Systola 1024,  input of new character (b j ) into the lower western IP, and  when l 1 > 2048, the input of previously computed H, E, and F cells and output of H, E, and F cells u For Fuzion 150, during the 16 new H-cells in each PE, one new character is input via Fuzion bus

39 04/23/2003CMSC 838T – Presentation 39 Instruction Counts u Instruction Count (IC) to update 2 and 16 H-cells in Systola 1024 and Fuzion 150, respectively: Operations in each PE per iteration stepSystolaFuzion Get H(i – 1, j), F(i – 1), b j, max i-1 from neighbor2022 Compute t = max{0, H(i – 1, j – 1) + Sbt(a i, b j )}20576 Compute F(i, j) = max{H(i – 1, j} – , F(i – 1, j) –  } 8336 Compute E(i, j) = max{H(i, j – 1} – , E(i, j – 1) –  } 8448 Compute F(i, j) = max{t, H(i, j}, F(i, j)}8368 Compute max i = max{H(i, j), max i-1 }4184 Sum681934

40 04/23/2003CMSC 838T – Presentation 40 Maximum Characters/PE u The memory per PE on Systola is 32 (16-bit) registers  2 characters per PE is the maximal possible  (2 chars x 20 AAs substitution row x 8-bit per substitution value = 20 registers) u The memory per PE on Fuzion is 2Kb  maximum chars per PE is 16  restricted due to “indirect addressing” per PE

41 04/23/2003CMSC 838T – Presentation 41 Indirect Address u An addressing mode found in many processors' instruction sets where the instruction contains the address of a memory location which contains the address of the operand (the "effective address") or specifies a register which contains the effective address

42 04/23/2003CMSC 838T – Presentation 42 Myrinet - Overview u Myrinet is a cost-effective, high-performance, packet- communication and switching technology that is widely used to interconnect clusters of workstations, PCs, servers, or single-board computers u Conventional networks (e.g., ethernet) can be used to build clusters, but do not provide the performance/features required for HPC or high- availability clustering

43 04/23/2003CMSC 838T – Presentation 43 Myrinet - Characteristics u Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports u Flow control, error control, and "heartbeat" continuity monitoring on every link u Low-latency, cut-through, crossbar switches, with monitoring for high- availability applications u Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts u Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets

44 04/23/2003CMSC 838T – Presentation 44 l q  processors: Hybrid Query sequence = M, Number of processors in ISA = N 2, assuming M = k x N: 1. k  N: Each k x N subarray computes the alignment of the same query sequence with different subject sequences 2. k ≥ N : k/N = 2: load 2 chars per PE k/N > 2: split query sequence into k/2N passes and load 2N 2 chars in each pass

45 04/23/2003CMSC 838T – Presentation 45 l q  processors: Fuzion 150 Length of query sequence = M, Number of processors = 1536: 1. k x M = 1536: k alignments of same query sequence w/ different subject sequences carried out in parallel 2. k x 1536 = M: Split into k passes – requires I/O of intermediate results in each step Data transfers can be minimized by assigning k/M chars per PE – currently 16 chars per PE is the limit

46 04/23/2003CMSC 838T – Presentation 46 Concept of true and false hits The following cases were distinguished: u true positives, alignments between proteins of similar structure that fall above a given threshold (defined by the sequence alignment method) u false positives, alignments between proteins of dissimilar structure that fall above a given threshold of the sequence alignment u true negatives, alignments between proteins of dissimilar structure that that fall below a given threshold u false negatives, alignments between proteins of similar structure that fall below a given threshold

47 04/23/2003CMSC 838T – Presentation 47 Guidelines When to use S-W ? u if you are looking for a protein distantly related to your query sequence (e.g., you have a known protein sequence and you want to find possible distant homologues) u if you are looking for the protein encoded in your low-quality DNA query sequence (e.g., you have a badly sequenced cDNA clone) u if you are looking for a DNA sequence corresponding to your protein query sequence (e.g., you want to identify potential homologues of your protein in the EST databases) When to use BLAST ? u if you are looking for close matches and you don't mind missing lower homology sequences u if you want a quick answer

48 04/23/2003CMSC 838T – Presentation 48 Performance Evaluation of SAMBA Query sequence length10301003001000300010000 Time in seconds Samba25 26304077210 DEC-Alpha – 150 Mhz Speed up 57 2.3 120 4.8 350 13.5 1041 34.7 3468 86.7 11510 150 38450 183 SUN-Sparc 5 – 110 MHz Speed up 95 3.8 239 9.5 746 28.6 2215 7.4 7300 183 24269 315 80300 382 DEC 5000/250 – 40 MHz Speed up 182 7.3 548 22 1407 54 4054 135 12920 323 41169 534 131193 625 Source: Jamet and Laveneir, CABIOS, 12(7), 609-615, 1997 ☞ The longer the query length, the better the speed-up

49 04/23/2003CMSC 838T – Presentation 49 Performance Evaluation of Kestrel USparc : Sun Ultrasparc 140 MHz B-SYS: 470-PE ISA Alpha: DEC Alpha – 433 MHz 1K MP2: 1K-PE MasPar Paragon: 32-node Paragon Decy-1: 1-board Decypher-II * Merc1: 1-board Mercury + Bcll-1: Biocellerator * Samba: 2-board Samba+ 16-MP2: 16K-PE MasPar FDF-3: 5-Board Paracell FDF + Kestrel: 1-board Kestrel Decy-15: 15-board Decypher-II* + (single purpose); * (FPGA) Source: Dahle et. al, PDPTA, 1243-1249, 1999

50 04/23/2003CMSC 838T – Presentation 50 Performance Evaluation of Splash-2 HardwareSpecificsMCUPS Splash-2Unidir; 16 boards43,000 Splash-2Bidir; 16 boards34,000 Splash-2Unidir; 1 board3,000 Splash-2Bidir; 1 board2,100 Splash-1Bidir; 746 PE’s370 SPARC 10/30 GXgcc –O21.2 VAX 6620VMS; CC1.0 SPARC-1gcc –O20.87 486DX-50 PCDOS; gcc –O20.67 Source: Hoang, IEEE-CMM, 185-191, 1993


Download ppt "04/23/2003 Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation."

Similar presentations


Ads by Google