Presentation is loading. Please wait.

Presentation is loading. Please wait.

Last lecture summary.

Similar presentations


Presentation on theme: "Last lecture summary."— Presentation transcript:

1 Last lecture summary

2 identity vs. similarity homology vs. similarity gap penalty
affine gap penalty gap penalty high fewer gaps, if investigating related sequences low more gaps, larger gaps, distantly related sequences

3 BLOSUM blocks focuse on substitution patterns only in blocks
BLOSUM62 – 62, what does it mean? BLOSUM vs. PAM BLOSUM matrices are based on observed alignments BLOSUM numbering system goes in reversing order as the PAM numbering system blocks - multiple alignments of ungapped segments, are highly conserved Sequences sharing no more than 62% identity were used to calculate BLOSUM62 matrix.

4 Selecting an Appropriate Matrix
Best use Similarity (%) Pam40 Short highly similar alignments 70-90 PAM160 Detecting members of a protein family 50-60 PAM250 Longer alingments of more divergent sequences ~30 BLOSUM90 BLOSUM80 BLOSUM62 Most effective in finding all potential similarities 30-40 BLOSUM30 <30 Similarity column gives range of similarities that the matrix is able to best detect.

5 Dynamic programming (DP)
Recursive approach, sequential dependency. 4th piece can be solved using solution of the 3rd piece, the 3rd piece can be solved by using solution of the 2nd piece and so on…

6 Sequence A ... Sequence B A…B
Best previous alignment New best alignment = previous best + local best ... If you already have the optimal solution to: X…Y A…B then you know the next pair of characters will either be: X…YZ or X…Y- or X…YZ A…BC A…BC A…B- You can extend the match by determining which of these has the highest score.

7 Co je to? Jak to vznikne?

8 Window size? Stringency? Color mapping? Frame shifts?
Larger windows size is used for DNA sequences because the number of random matches is much greater due to the presence of only four characters in the alphabet. A typical window size for DNA is 15, with stringency 10. For proteins the matrix has not to be filtered at all, or windows 2/3 with stringency 2 can be used. Frame shifts … g) a h) (mutace, inserce, delece) a-f) are self similiarity dot plots (T1=T2). g-h) are dot plots comparing two different sequences of similar length. a) A continuous main diagonal shows perfect similarity for symbols with the same indices. b) Parallels to the main diagonal indicate repeated regions in the same reading direction on different parts of the sequences. In this case a region D is found twice in the sequence (D1, D2, so called ‘duplications’). c) Lines perpendicular to the main diagonal indicate palindromic areas. In this case the sequence is completely palindromic in the displayed area. As an example the latin sentence ‘SATOR AREPO TENET OPERA ROTAS’ might be consulted. d) Partially palindromic sequence (For DNA sequences this refers to a perfect match of the normal strand with its reverse complement, which is frequently found for many transposable elements. e) Bold blocks on the main diagonal indicate repetition of the same symbol in both sequences, e.g. (G)50, so called microsatellite repeats f) Parallel lines indicate tandem repeats of a larger motif in both sequences, e.g. (AGCTCTGAC)20, so called minisatellite patterns. The distance between the diagonals equals the distance of the motif. g) When the diagonal is a discontinuous line this indicates that the sequences T1 and T2 share a common source. In literal analyses we may have to deal with plagiarism or in DNA analyses sequences may be homologous because of a common ancestor. The number of interruptions increases with modifications on the text or the time of independent evolution and mutation rate. h) Partial deletion in sequence 1 or insertion in sequence 2, so called ‘indel’. In protein coding sequences this can be often observed for many different types of domains, which got lost or substituted during evolution (Beaussart et al. 2007). Also comparing mRNA (cDNA) sequences without introns (T1) against the unspliced DNA sequence (T2) generally yields this picture. Frame shifts?

9 New stuff

10 Homology vs. similarity again
Just a reminder of the important concept in sequence analysis – homology. It is a conclusion about a common ancestral relationship drawn from sequence similarity. Sequence similarity is a direct result of observation from the sequence alignment. It can be quantified using percentages, but homology can not! It is important to understand this difference between homology and similarity. If the similarity is high enough, a common evolutionary relationship can be inferred. sequence similarity is a percentage of aligned residues that are similar in physicochemical properties such as size, charge, hydrophobicity squence identity is a special case of similarity, only exact matches are counted

11 Limits of the alignment detection
However, what is enough? What are the detection limits of pairwise alignments? How many mutations can occur before the differences make two sequences unrecognizable? Intuitively, at some point are two homologous sequences too divergent for their alignment to be recognized as significant. The best way to determine detection limits of pairwise alignment is to use statistical hypothesis testing. See later.

12 Twilight zone However, the level one can infer homologous relationship depends on type of sequence (proteins, NA) and on the length of the alignment. Unrelated sequences of DNA have at least 25% chance to be identical. For proteins it is 5%. If gaps are allowed, this percentage can increase up to 10-20%. The shorter the sequence, the higher the chance that some alignment can be attributed to random chance. This suggest that shorter sequences require higher cuttof for inferring homology than longer sequences. For determining a homology relationship of two protein sequences, for example, if both sequences are aligned at full length, which is 100 residues long, an identity of 30% or higher can be safely regarded as having close homology. They are sometimes referred to as being in the “safe zone”. If their identity level falls between 20% and 30%, determination of homologous relationships in this range becomes less certain. This is the area often regarded as the “twilight zone,” where remote homologs mix with randomly related sequences. Below 20% identity, where high proportions of nonrelated sequences are present, homologous relationships cannot be reliably determined and thus fall into the “midnight zone.” It needs to be stressed that the percentage identity values only provide a tentative guidance for homology identification. This is not a precise rule for determining sequence relationships, especially for sequences in the twilight zone. A statistically more rigorous approach to determine homologous relationships exist.

13 The three zones of protein sequence alignments
The three zones of protein sequence alignments. Two protein sequences can be regarded as homologous if the percentage sequence identity falls in the safe zone. Sequence identity values below the zone boundary, but above 20%, are considered to be in the twilight zone, where homologous relationships are less certain. The region below 20% is the midnight zone, where homologous relationships cannot be reliably determined. This is not a precise rule for determining sequence relationships, especially for sequences in the twilight zone. A statistically more rigorous approach to determine homologous relationships. Essential bioinformatics, Xiong

14 Statistical significance
Key question – Constitutes a given alignment evidence for homology? Or did it occur just by chance? The statistical significance of the alignment (i.e. its score) can be tested by statistical hypotheses testing. good introduction to statistical significance of alignments is in Pevsner, Chapter 3, page 89 and further Very good review paper is: Pagni M, Jongeneel CV. Making sense of score statistics for sequence alignments. Brief Bioinform Mar;2(1): PubMed PMID:

15 Significance of global alignment
We align two proteins: human beta globin and myoglobin. We obtain score S. And we want to know if such a score is significant or if it appeared just by a chance. How to proceed? State H0 two sequences are not related, score S represents a chance occurrence State Ha Choose a significance level 𝛼 Statistics of distribution. i.e. sample mean, sample standard deviation

16 Database similarity searching

17 BLAST Basic Local Alignment Search Tool (BLAST) – Google of the sequence world. Compare a protein or DNA sequence to other sequences in various databases, main tool of NCBI. Why to search database Determine what orthologs and paralogs are known for a particular sequence. Determine what proteins or genes are present in a particular organism. Determine the identity of a DNA or protein sequence. Determine what variants have been described for a particular gene or protein. Investigate ESTs. Explore amino acid residues that are important in the function and/or structure of a protein (multiple alignment of BLAST results, conserved residues).

18 Database searching requirements I
query sequence, perform pairwise alignments between the query and the whole database (target) Typically, this means that millions of alignments are analyzed in a BLAST search, and only the most closely related matches are returned. We are usually more interested in identifying locally matching regions such as protein domains. Global alignment (Needlman-Wunsch) is not often used. Smith-Watermann is too computationally intensive. Instead, heuristic is utilized, significant speed up.

19 Database searching requirements II
sensitivity – the ability to find as many correct hits (TP) as possible selectivity (specificity) – ability to exclude incorrect hits (FP) speed ideally: high sensitivity, high specificity, high speed reality: increase in sensitivity leads to decrease in specificity, improvement in speed often comes at the cost of lowered sensitivity and selectivity

20 Types of algorithms exhaustive heuristic
uses a rigorous algorithm to find the exact solution for a particular problem by examining all mathematical combinations example: DP heuristic computational strategy to find an empirical or near optimal solution by using rules of thumb this type of algorithms take shortcuts by reducing the search space according to some criteria the shortcut strategy is not guaranteed to find the best or most accurate solution

21 Heuristic algorithms Perform faster searches because they examine only a fraction of the possible alignments examined in regular dynamic programming currently, there are two major algorithms: FASTA BLAST Not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster than DP. The increased computational speed comes at a moderate expense of sensitivity and specificity of the search, which is easily tolerated by working molecular biologists.

22 BLAST Parts of algorithm BLAST uses word method for pairwise alignment
list, scan, extend BLAST uses word method for pairwise alignment Find short stretches of identical (or nearly identical) letters in two sequences – words (similar to window in dot plot) Basic assumption: two related sequences must have at least one word in common By first identifying word matches, a longer alignment can be obtained by extending similarity regions from the words. Once regions of high sequence similarity are found, adjacent high-scoring regions can be joined into a full alignment.

23 BLAST - list Compile a list of “words” of a fixed length w that are derived from the query sequence. protein searches – word size = 3, NA searches = 11 A threshold value T is established for the score of aligned words (true for proteins, for NAs exact matches are used). Those words either at or above the threshold are collected and used to identify database matches; those words below threshold are not further pursued. The threshold score T can be lowered to identify more initial pairwise alignments. This will increase the time required to perform the search and may increase the sensitivity

24

25 BLAST - scan After compiling a list of word pairs at or above threshold T, the BLAST algorithm scans a database for hits. This requires BLAST to search an index of the database to find entries that correspond to words on the compiled list.

26 BLAST - extend Extend hits to find alignments called high-scoring segment pairs (HSPs). Extend in both directions (ungapped originally, gapped BLAST is newer), count the alignment score. The extension process is terminated when a score falls below a cutoff.

27 BLAST strategy Compare a protein or DNA query sequence to each database entry and form pairwise alignments (HSPs). When the threshold parameter is raised, the speed of the search is increased, but fewer hits are registered, and so distantly related database matches may be missed. When the threshold parameter is lowered, the search proceeds more slowly, but many more word hits are evaluated, and thus sensitivity is increased.

28 Recent improvement – gapped BLAST Variants
BLASTN – nucleotide sequences BLASTP – protein sequences BLASTX – uses nucleotide sequences as queries and translates them in all six reading frames to produce translated protein sequences, which are used to query a protein sequence database TBLASTN – queries protein sequences to a nucleotide sequence database with the sequences translated in all six reading frames TBLASTX – uses nucleotide sequences, which are translated in all six frames, to search against a nucleotide sequence database that has all the sequences translated in six frames.

29 Which sequence to search?
The choice of the type of sequences also influences the sensitivity of the search. Clear advantage of using protein sequences in detecting homologs If the input sequence is a protein-encoding DNA sequence, use BLASTX (six open reading frames before sequence comparisons) If you’re looking for protein homologs encoded in newly sequenced genomes, you may use TBLASTN. This may help to identify protein coding genes that have not yet been annotated. If a DNA sequence is to be used as the query, a protein-level comparison can be done with TBLASTX. TBLASTN, TBLASTX are very computationally intensive and the search process can be very slow.

30 E-value I expected value
a parameter that describes the number of hits one can 'expect' to see by chance when searching a database of a particul decreases exponentially as the score of the match increasesar size an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance If the database were twice as big, there would be twice the likelihood of finding a score equal to or greater than S by chance.

31 E-value II E < … extremely high confidence that the database match is a result of homologous relationships E is from (10-50 , 0.01) … the match can be considered a result of homology (for proteins, conclusive are E-values < 0.001) E is from (0.01, 10) … the match is considered not significant, but may hint tentative remote homology E > 10 … the sequences under consideration are either unrelated or related by extremely distant relationships that fall below the limit of detection with the current method. E-value is proportional to the database size, as database grows E-value for a given sequence match increases. However, the evolutionary relationship between two sequences remains constant. As the db grows, one may lose previously detected homologs.

32 Bit score A typical BLAST output reports both E values and scores.
There are two kinds of scores: raw and bit scores. Raw scores are calculated from the substitution matrix and the gap penalty parameters that are chosen. The bit score S’ is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Bit scores from different alignments, even those employing different scoring matrices in separate BLAST searches, can be compared.


Download ppt "Last lecture summary."

Similar presentations


Ads by Google