InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Slides:

Advertisements

Similar presentations

Multi-Threading LAME MP3 Encoder

Advertisements

Introductions to Parallel Programming Using OpenMP

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

HPC - High Performance Productivity Computing and Future Computational Systems: A Research Engineer’s Perspective Dr. Robert C. Singleterry Jr. NASA Langley.

Performance Analysis of Multiprocessor Architectures

Parallel Processors Todd Charlton Eric Uriostique.

Computer Organization and Architecture 18 th March, 2008.

MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.

Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.

“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.

BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.

From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.

Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

Parallel Programming Models

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

High Performance Computing on an IBM Cell Processor --- Bioinformatics

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Department of Computer Science University of California, Santa Barbara

Chapter 4: Threads.

Multithreaded Programming

Chapter 4: Threads & Concurrency

Department of Computer Science University of California, Santa Barbara

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King Mongkut’s Institute of Technology, Ladkrabang, Thailand National Center for Genetic Engineering and Biotechnology, Thailand Dr. Surin Kittitornkun Dr. Sissades Tongsima Kridsadakorn Chaichoompu 1

Outline  Introduction  Case Study  Existing works  Speedup of our approach  Comparison  Discussion  Our strategies  Limitation  Conclusion 2

Motivation  New modern processors are launched  How to make a use of new technologies? Dual-core CPU Quad-core CPU 3

Motivation [2]  What is the difference between old and new CPUs? 4 Dual-core, Max. speedup ~2xQuad-core, Max. speedup ~4x

Problems  Old sequential software is still used? Yes, especially the science and bioinformatics tools  Why do the scientists still use? Mostly they care about novel algorithms and knowledge. They don't care about speed  Why don't we use the PC cluster? Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data 5

Our Contribution  The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered  Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW 6

Case Study: ClustalW ClustaW is a general purpose multiple alignment program for DNA or proteins. 7

All pairwise alignments ClustalW example S 1 ALSK S 2 TNSD S 3 NASK S 4 NTSD S1S1 S2S2 S3S3 S4S4 S1S S2S2 083 S3S3 07 S4S Align S 1 with S 3 2. Align S 2 with S 4 3. Align (S 1, S 3 ) with (S 2, S 4 ) Distance Matrix Multiple Alignment Steps Neighbor Joining -ALSK NA-SK -TNSD NT-SD -ALSK -TNSD NA-SK NT-SD Multiple Alignment 8

Existing works  ClustalW-MPI: ClustalW analysis using distributed and parallel computing K.B. Li, Bioinformatics 19, 2003  Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic Scheduling J. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05  SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL D. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio 9

Speedup of our approach *Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT- ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT- ClustalW with optimization and our assist ,672474,1095,472,407VI ,188473,3595,900,891V ,984511,0477,009,875IV ,985880,9699,656,750III ,016881,12510,387,046II -333,110932,71811,918,672I Test data sequences, 1000 amino acids Progressive Alignment Neighbor Joining Distance Matrix Overall speedup Elapsed times (ms)‏Running mode* 10 Data set  Protein sequences from NCBI Run time: from 3 h. 40 m. down to 1 h. 43 m.

ClustalW Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids. 11

Multithreaded ClustalW  Speedup of the optimized versions of MT- ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids. 12

Comparison 13 ClustalW-MPIParallel MSASGIClustalW-MTV Number of sequences Sequence length MachinePC Cluster Single PC Shared memory Processors2 222 Speedup1.75x1.8x 2.25x  Why does the speedup is over 2x? Because of the special unit in the new CPU  Does the special unit normally work with common software? No, we have to activate it.

Speedup > 2x for dual-CPU? [1] Amdahl’s Law 14 S  Speedup

Speedup > 2x for dual-CPU? [2] 15 Speedup 1.21 Speedup 1.70 Data set  800 sequences, 1000 amino acids

Our strategies  Step 1: Analyzing and Profiling To find the software structure and where the bottle neck is  Step 2: Applying the methodologies Multithreading & Vectorizing (one of the optimization method)  Step 3: Validating To compare the result with the original one. For sure, the result is not changed 16

Strategy: Multithreading  The Proposed Multithreading Strategy To improve the bottle neck of the software which is non-threaded part  To rise the throughput of the program by applying multithreading strategy  Reduce the overhead of thread creation 17

Profile the software Profiled by Intel Thread Profiler Distance matrix Neighbor joining Progressive alignment 18

Implementation Apply the Thread library for this loop 19

Trick Reduce Thread Creation Overhead T1T2 T4 P1P2P3P4 P5P6P7P8 P9P10P11P12 4 Threads Parameters 20

Strategy: Vectorizing  Proposed Optimizing and Vectorizing Methodology Find the frequent used functions in the program Applying the Loop Optimizing Methodologies Use the advantage of Intel C++ Compiler to optimize the code, also enable vectorizing option 21

Frequent used functions 22 Function Clockticks (%) Methodology* diff A,B prfscore C forward_pass calc_score D reverse_pass A pdiff *Note: A is Loop reversal, B is Loop fission, C is Type Casting, and D is Procedure call reduction Profiled by Intel VTune

Loop Reversal  That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set. 23

Loop Fission  A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements. 24

Limitation  Available compliers and programming languages C/C++  Intel C++ complier (Windows, Linux, Mac) Fortran  Intel Fortran complier (Windows, Linux, Mac)  Available processors CPU with Hyper-thread technology or above (Intel, AMD) 25

Conclusion  Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++  Proposed framework: multithreading and vectorizing strategies  Higher speedup by taking the advantage of multicore architecture technology  Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer 26

Questions? Thank you 27