Download presentation

Presentation is loading. Please wait.

Published byDawson Tremelling Modified about 1 year ago

1
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic

2
ISBRA Presentation Outline Similarity search in protein sequence databases Smith-Waterman algorithm Common parts basic algorithm inversed sequences inexact search Experiments Conclusion

3
ISBRA Similarity Measures two strings of amino-acids hamming distance sequences of equal length number of non-identical positions edit distance minimal number of operations insert/update/delete to convert one sequence to the other weighted edit distance takes into account probability of updating one letter to the other scoring (substitution) matrices PAM, BLOSUM, … different costs for opening/extending a gap global/local alignment

4
ISBRA Global Alignment Global alignment aligning whole sequences weighted edit distance Needleman-Wunsch optimal alignment between 2 sequences a and b distance matrix δ gap cost σ s i,j – optimal alignment of prefixes a and b of length i and j s 0,j = j*σ, s i,0 = i*σ s |a|, |b| … value of the optimal alignment NPHGIIMGLAE --HG--LGL BLOSUM 62 gap cost … -1 O(|a||b|) adding gap to a adding gap to b align a i and b j

5
ISBRA Local Alignment Local alignment best global alignment of all pairs of subsequences of a and b Smith-Waterman modification of Needleman-Wunsch allowing “free ride” from the start by incorporating zero value s 0,j = 0, s i,0 = 0 max(s i,j ) … value of optimal alignment NPHGIIMGLAE HGL gap extending - σ gap opening - ρ BLOSUM 62 gap cost … -11

6
ISBRA Speeding-up Database Search non-rigorous search heuristic approaches trading off accuracy for speed BLAST, FASTA rigorous search indexing weighted edit distance is not metric in general → MAMs not applicable turning distance to metric – limited to q-grams parallelism run more alignments concurrently MPSrch distance computation itself FPGA (field-programmable gate arrays) instructions for parallelism

7
ISBRA Common Alignment Matrices Parts 1. align s i with the query sequence 2. replace s i with s i+1 3. start alignment form (n+1) st row do the same with h and v matrices algorithm stays intact pre-step – sorting prefix ratio (PR) – speed-up

8
ISBRA Reversed Sequences score of the alignment is independent on the direction of the alignment possibility of aligning according to suffixes (prefixes of reversed sequences) division of the database to 2 groups (prefixes, suffixes) – greedy algorithm: 1. building stage divide a given percent of the database randomly and the rest so that PR increases in every step 2. shifting stage move random sequence to oposite group if it would increase the overall PR repeat step 2 n times

9
ISBRA Inexact Search bigger database (#sequences) → higher PR split sequences → increase of database size proportional to number of splits → inaccuracy sequences with alignment spreading over the split might not be in the result any more

10
ISBRA Experimental Results UniProt DB max. sequence length 3000 (99,9% of UniProt) random subset 1.000, 5.000, , , , , , , , , semantically motivated subsets archaea, bacteria, fungi, human, invertebrates, mammals, plants, rodents, vertebrates, viruses Testing of prefix ratio of basic solution reversed sequences chopped sequences

11
ISBRA Experiments - Prefix Ratio of Random Subsets and Taxonomic Divisions

12
ISBRA Experiments – Reversed Sequences after the building stage after the shifting stage without reversed sequences

13
ISBRA Experiments – Chopped Sequences

14
ISBRA Conclusion We have proposed simple method for speeding up the database search of protein sequences by using common prefixes and suffixes easy implementation with current methods rigorous and non-rigorous version of the algorithm We implemented modification of Smith-Waterman algorithm Experimental results we have shown up to 20% speed-up with the rigorous version of the algorithm

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google