Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to bioinformatics

Similar presentations


Presentation on theme: "Introduction to bioinformatics"— Presentation transcript:

1 Introduction to bioinformatics
Sequence Alignment Part 3

2 WHATS TODAY? MORE BLAST …. Similarity scores for protein sequences
Gaps Statistical significance (e-value)

3 Protein Sequence Alignment
Rule of thumb: Proteins are homologous if 25% identical (length >100) DNA sequences are homologous if 70% identical

4 Protein Pairwise Sequence Alignment
The alignment tools are similar to the DNA alignment tools BLASTN for nucleotides BLASTP for proteins Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores: Score s(i,j) > 0 if amino acids i and j have similar properties Score s(i,j) is  0 otherwise How should we score s(i,j)?

5 The 20 Amino Acids

6 Chemical Similarities Between Amino Acids
Acids & Amides DENQ (Asp, Glu, Asn, Gln) Basic HKR (His, Lys, Arg) Aromatic FYW (Phe, Tyr, Trp) Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr) Hydrophobic ILMV (Ile, Leu, Met, Val)

7 Sequence Alignment based on AA similarity
TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ QSYSTPHFSQGTKLEI | | | +| | | +|+ || || | | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN NFYPREAKVQWKVD ++||| | | | | ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity + = similarity

8 Amino Acid Substitutions Matrices
When scoring protein sequence alignments it is common to use a matrix of 20  20, representing all pairwise comparisons : Substitution Matrix

9 Given an alignment of closely related sequences
we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y Q E In this column E & D are found 7/8

10 Amino Acid Matrices Symmetric matrix of 20x20 entries: entry (i,j)=entry(j,i) Entry (i,i) is greater than any entry (i,j), ji. Entry (i,j): the score of aligning amino acid i against amino acid j.

11 PAM - Point Accepted Mutations
Developed by Margaret Dayhoff, 1978. Analyzed very similar protein sequences Proteins are evolutionary close. Alignment is easy. Point mutations - mainly substitutions Accepted mutations - by natural selection. Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Found that common substitutions occurred involving chemically similar amino acids.

12 PAM 250 Similar amino acids are close to each other.
Regions define conserved substitutions.

13 Example: Asp & Glu Score = 3 C H +H3N COO- HCH O- O COO- +H3N C H HCH
Aspartate (Asp, D) Glutamate (Glu, E)

14 Selecting a PAM Matrix Low PAM numbers: short sequences, strong local similarities. High PAM numbers: long sequences, weak similarities. PAM120 recommended for general use (40% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices PAM40, PAM120, PAM250 recommended

15 BLOSUM Blocks Substitution Matrix
Steven and Jorga G. Henikoff (1992) Based on BLOCKS database ( Families of proteins with identical function Highly conserved protein domains Ungapped local alignment to identify motifs Each motif is a block of local alignment Counts amino acids observed in same column Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

16 BLOSUM Matrices Different BLOSUMn matrices are calculated independently from BLOCKS BLOSUMn is based on sequences that are at most n percent identical.

17 Selecting a BLOSUM Matrix
For BLOSUMn, higher n suitable for sequences which are more similar BLOSUM62 recommended for general use BLOSUM80 for close relations BLOSUM45 for distant relations

18 Summary: BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

19 Gap Scores Example showed -1 score per indel
So gap cost is proportional to its length Biologically, indels occur in groups We want our gap score to reflect this Standard solution: affine gap model Once-off cost for opening a gap Lower cost for extending the gap Changes required to algorithm

20 Scoring system = Substitution Matrix + Gap Penalty

21 Gap penalty We expect to penalize gaps
Scoring for gap opening & for extension Insertions and deletions are rare in evolution But once they are created, they are easy to extend Gap-extension penalty < gap-open penalty Default gap parameters are given for each matrix: PAM30: open=9, extension=1 PAM250: open=14, extension=2

22 Low Complexity Sequences
AAAAAAAAAAA ATATATATATATA CAGCAGCAGCAG Sequences of low complexity can cause getting significant hits which are not true homologues !!! How does BLAST deal with low complexity sequences? By default low complexity sequences are filtered out and replaced by XXXXX

23 Statistical significance

24 E-value The lower bound is normally 0 (we want to find the best)
The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size. higher e-value lower similarity “sequences with E-value of less than 0.01 are almost always found to be homologous” The lower bound is normally 0 (we want to find the best)

25 Expectation Values Increases linearly with length of query sequence
Decreases exponentially with score of alignment Increases linearly with length of database

26 E value: Number of hits of score ≥ S expected by chance
Bit score (S) Similar to alignment score Normalized Higher means more significant E value: Number of hits of score ≥ S expected by chance Based on random database of similar size Lower means more significant Used to assess the statistical significance of the alignment

27 Remote homologues PSI-BLAST Sometimes BLAST isn’t enough.
Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST

28 Construct profile from blast results
PSI-BLAST Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

29 PSI-BLAST Advantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friends Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration


Download ppt "Introduction to bioinformatics"

Similar presentations


Ads by Google