Presentation is loading. Please wait.

Presentation is loading. Please wait.

InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Similar presentations


Presentation on theme: "InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King."— Presentation transcript:

1 InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King Mongkut’s Institute of Technology, Ladkrabang, Thailand National Center for Genetic Engineering and Biotechnology, Thailand Dr. Surin Kittitornkun Dr. Sissades Tongsima Kridsadakorn Chaichoompu kridsadakorn.cha@biotec.or.th 1

2 Outline  Introduction  Case Study  Existing works  Speedup of our approach  Comparison  Discussion  Our strategies  Limitation  Conclusion 2

3 Motivation  New modern processors are launched  How to make a use of new technologies? Dual-core CPU Quad-core CPU 3

4 Motivation [2]  What is the difference between old and new CPUs? 4 Dual-core, Max. speedup ~2xQuad-core, Max. speedup ~4x

5 Problems  Old sequential software is still used? Yes, especially the science and bioinformatics tools  Why do the scientists still use? Mostly they care about novel algorithms and knowledge. They don't care about speed  Why don't we use the PC cluster? Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data 5

6 Our Contribution  The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered  Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW 6

7 Case Study: ClustalW ClustaW is a general purpose multiple alignment program for DNA or proteins. 7

8 All pairwise alignments ClustalW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD S1S1 S2S2 S3S3 S4S4 S1S1 0947 S2S2 083 S3S3 07 S4S4 0 1. Align S 1 with S 3 2. Align S 2 with S 4 3. Align (S 1, S 3 ) with (S 2, S 4 ) Distance Matrix Multiple Alignment Steps Neighbor Joining -ALSK NA-SK -TNSD NT-SD -ALSK -TNSD NA-SK NT-SD Multiple Alignment 8

9 Existing works  ClustalW-MPI: ClustalW analysis using distributed and parallel computing K.B. Li, Bioinformatics 19, 2003  Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic Scheduling J. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05  SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL D. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio 9

10 Speedup of our approach *Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT- ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT- ClustalW with optimization and our assist 2.12244,672474,1095,472,407VI 1.98253,188473,3595,900,891V 1.70252,984511,0477,009,875IV 1.21327,985880,9699,656,750III 1.14338,016881,12510,387,046II -333,110932,71811,918,672I Test data - 800 sequences, 1000 amino acids Progressive Alignment Neighbor Joining Distance Matrix Overall speedup Elapsed times (ms)‏Running mode* 10 Data set  Protein sequences from NCBI Run time: from 3 h. 40 m. down to 1 h. 43 m.

11 ClustalW Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids. 11

12 Multithreaded ClustalW  Speedup of the optimized versions of MT- ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids. 12

13 Comparison 13 ClustalW-MPIParallel MSASGIClustalW-MTV Number of sequences50080600 Sequence length1100289-399390400 MachinePC Cluster Single PC Shared memory Processors2 222 Speedup1.75x1.8x 2.25x  Why does the speedup is over 2x? Because of the special unit in the new CPU  Does the special unit normally work with common software? No, we have to activate it.

14 Speedup > 2x for dual-CPU? [1] Amdahl’s Law 14 S  Speedup

15 Speedup > 2x for dual-CPU? [2] 15 Speedup 1.21 Speedup 1.70 Data set  800 sequences, 1000 amino acids

16 Our strategies  Step 1: Analyzing and Profiling To find the software structure and where the bottle neck is  Step 2: Applying the methodologies Multithreading & Vectorizing (one of the optimization method)  Step 3: Validating To compare the result with the original one. For sure, the result is not changed 16

17 Strategy: Multithreading  The Proposed Multithreading Strategy To improve the bottle neck of the software which is non-threaded part  To rise the throughput of the program by applying multithreading strategy  Reduce the overhead of thread creation 17

18 Profile the software Profiled by Intel Thread Profiler Distance matrix Neighbor joining Progressive alignment 18

19 Implementation Apply the Thread library for this loop 19

20 Trick Reduce Thread Creation Overhead T1T2 T4 P1P2P3P4 P5P6P7P8 P9P10P11P12 4 Threads Parameters 20

21 Strategy: Vectorizing  Proposed Optimizing and Vectorizing Methodology Find the frequent used functions in the program Applying the Loop Optimizing Methodologies Use the advantage of Intel C++ Compiler to optimize the code, also enable vectorizing option 21

22 Frequent used functions 22 Function Clockticks (%) Methodology* diff 33.36 A,B prfscore 15.93 C forward_pass 14.91 - calc_score 12.93 D reverse_pass 11.45 A pdiff 5.85 - *Note: A is Loop reversal, B is Loop fission, C is Type Casting, and D is Procedure call reduction Profiled by Intel VTune

23 Loop Reversal  That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set. 23

24 Loop Fission  A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements. 24

25 Limitation  Available compliers and programming languages C/C++  Intel C++ complier (Windows, Linux, Mac) Fortran  Intel Fortran complier (Windows, Linux, Mac)  Available processors CPU with Hyper-thread technology or above (Intel, AMD) 25

26 Conclusion  Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++  Proposed framework: multithreading and vectorizing strategies  Higher speedup by taking the advantage of multicore architecture technology  Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer 26

27 Questions? Thank you 27


Download ppt "InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King."

Similar presentations


Ads by Google