Presentation on theme: "Introduction to bioinformatics"— Presentation transcript:
1 Introduction to bioinformatics Sequence AlignmentPart 3
2 WHATS TODAY? MORE BLAST …. Similarity scores for protein sequences GapsStatistical significance (e-value)
3 Protein Sequence Alignment Rule of thumb: Proteins are homologous if 25% identical (length >100) DNA sequences are homologous if 70% identical
4 Protein Pairwise Sequence Alignment The alignment tools are similar to the DNA alignment toolsBLASTN for nucleotidesBLASTP for proteinsMain difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores:Score s(i,j) > 0 if amino acids i and j have similar propertiesScore s(i,j) is 0 otherwiseHow should we score s(i,j)?
8 Amino Acid Substitutions Matrices When scoring protein sequence alignments it is common to use a matrix of 20 20, representing all pairwise comparisons :Substitution Matrix
9 Given an alignment of closely related sequences we can score the relation between amino acidsbased on how frequently they substitute each otherM G Y D EM G Y E EM G Y Q EIn this columnE & D are found7/8
10 Amino Acid MatricesSymmetric matrix of 20x20 entries: entry (i,j)=entry(j,i)Entry (i,i) is greater than any entry (i,j), ji.Entry (i,j): the score of aligning amino acid i against amino acid j.
11 PAM - Point Accepted Mutations Developed by Margaret Dayhoff, 1978.Analyzed very similar protein sequencesProteins are evolutionary close.Alignment is easy.Point mutations - mainly substitutionsAccepted mutations - by natural selection.Used global alignment.Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j)Found that common substitutions occurred involving chemically similar amino acids.
12 PAM 250 Similar amino acids are close to each other. Regions define conserved substitutions.
13 Example: Asp & Glu Score = 3 C H +H3N COO- HCH O- O COO- +H3N C H HCH Aspartate(Asp, D)Glutamate(Glu, E)
14 Selecting a PAM MatrixLow PAM numbers: short sequences, strong local similarities.High PAM numbers: long sequences, weak similarities.PAM120 recommended for general use (40% identity)PAM60 for close relations (60% identity)PAM250 for distant relations (20% identity)If uncertain, try several different matricesPAM40, PAM120, PAM250 recommended
15 BLOSUM Blocks Substitution Matrix Steven and Jorga G. Henikoff (1992)Based on BLOCKS database (Families of proteins with identical functionHighly conserved protein domainsUngapped local alignment to identify motifsEach motif is a block of local alignmentCounts amino acids observed in same columnSymmetrical model of substitutionAABCDA… BBCDADABCDA. A.BBCBBBBBCDABA.BCCAAAAACDAC.DCBCDBCCBADAB.DBBDCCAAACAA… BBCCC
16 BLOSUM MatricesDifferent BLOSUMn matrices are calculated independently from BLOCKSBLOSUMn is based on sequences that are at most n percent identical.
17 Selecting a BLOSUM Matrix For BLOSUMn, higher n suitable for sequences which are more similarBLOSUM62 recommended for general useBLOSUM80 for close relationsBLOSUM45 for distant relations
18 Summary:BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gapsPAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions
19 Gap Scores Example showed -1 score per indel So gap cost is proportional to its lengthBiologically, indels occur in groupsWe want our gap score to reflect thisStandard solution: affine gap modelOnce-off cost for opening a gapLower cost for extending the gapChanges required to algorithm
20 Scoring system =Substitution Matrix +Gap Penalty
21 Gap penalty We expect to penalize gaps Scoring for gap opening & for extensionInsertions and deletions are rare in evolutionBut once they are created, they are easy to extendGap-extension penalty < gap-open penaltyDefault gap parameters are given for each matrix:PAM30: open=9, extension=1PAM250: open=14, extension=2
22 Low Complexity Sequences AAAAAAAAAAAATATATATATATACAGCAGCAGCAGSequences of low complexity can cause getting significant hitswhich are not true homologues !!!How does BLAST deal with low complexity sequences?By default low complexity sequences are filtered outand replaced by XXXXX
24 E-value The lower bound is normally 0 (we want to find the best) The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size.higher e-value lower similarity“sequences with E-value of less than 0.01 are almost always found to be homologous”The lower bound is normally 0 (we want to find the best)
25 Expectation Values Increases linearly with length of query sequence Decreases exponentially with score of alignmentIncreases linearly with length of database
26 E value: Number of hits of score ≥ S expected by chance Bit score (S)Similar to alignment scoreNormalizedHigher means more significantE value: Number of hits of score ≥ S expected by chanceBased on random database of similar sizeLower means more significantUsed to assess the statistical significance of the alignment
27 Remote homologues PSI-BLAST Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant membersPSI-BLAST
28 Construct profile from blast results PSI-BLASTPosition Specific Iterated BLASTRegular blastConstruct profile from blast resultsBlast profile searchFinal results
29 PSI-BLASTAdvantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friendsDisadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration