Presentation is loading. Please wait.

Presentation is loading. Please wait.

|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.

Similar presentations


Presentation on theme: "|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG."— Presentation transcript:

1 || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… Before we begin…

2 Pairwise Sequence Alignment Lesson 2

3 What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE

4 Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins

5 Local vs. Global  Global alignment – finds the best alignment across the whole two sequences.  Local alignment – finds regions of high similarity in parts of the sequences.  Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Global alignment: forces alignment in regions which differ Local alignment concentrates on regions of high similarity

6 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA  AAGTA Sequence evolution AAG T A Insertion

7 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA  AAGTA 2. Deletion – a deletion of a letter (or more) from the sequence. AAGA  AGA Sequence evolution AAG Deletion A

8 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA  AAGTA 2. Deletion - deleting a letter (or more) from the sequence. AAGA  AGA 3. Substitution – a replacement of one (or more) sequence letter by another AAGA  AACA AAGA  AACA Evolutionary changes in sequences AAA Substitution G C Insertion + Deletion  Indel

9 Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment: This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches

10 Choosing an alignment:  Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

11 Scoring an alignment: example - naïve scoring system:  Match: +1  Mismatch: -2  Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score  Better alignment

12 Scoring system:  Different scoring systems can produce different optimal alignments  Scoring systems implicitly represent a particular theory of similarity/dissimilarity between sequence characters: evolution based, physico-chemical properties based Some mismatches are more plausible Some mismatches are more plausible Transition vs. TransversionTransition vs. Transversion Lys  Arg ≠ Lys  CysLys  Arg ≠ Lys  Cys Gap extension Vs. Gap opening Gap extension Vs. Gap opening

13 Substitutions Matrices  Nucleic acids: Transition-transversion Transition-transversion  Amino acids: Evolution (empirical data) based: (PAM, BLOSUM) Evolution (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan) Physico-chemical properties based (Grantham, McLachlan)

14 PAM Matrices  Family of matrices PAM 80, PAM 120, PAM 250  The number with PAM matrices represent evolutionary distance  Greater numbers denote greater distances

15 Which PAM matrix to use?  Low PAM numbers: strong similarities  High PAM numbers: weak similarities PAM120 for general use (40% identity) PAM120 for general use (40% identity) PAM60 for close relations (60% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) PAM250 for distant relations (20% identity)  If uncertain, try several different matrices PAM40, PAM120, PAM250 PAM40, PAM120, PAM250

16 PAM - limitations  Based on only one original dataset  Examines proteins with few differences (85% identity)  Based mainly on small globular proteins so the matrix is biased

17 BLOSUM Matrices  Different BLOSUMn matrices are calculated independently from BLOCKS  BLOSUMn is based on sequences that share at least n percent identity  BLOSUM62 represents closer sequences than BLOSUM45

18 Example : Blosum62 derived from blocks of sequences that share at least 62% identity

19 Which BLOSUM matrix to use?  Low BLUSOM numbers for distant sequences  High BLUSOM numbers for similar sequences BLOSUM62 for general use BLOSUM62 for general use BLOSUM80 for close relations BLOSUM80 for close relations BLOSUM45 for distant relations BLOSUM45 for distant relations

20 PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

21 Gap penalty  We expect to penalize gaps  A different score for gap opening and for extension Insertions and deletions are rare in evolution Insertions and deletions are rare in evolution But once they occur, they are easy to extend But once they occur, they are easy to extend Gap-extension penalty < gap-opening penalty Gap-extension penalty < gap-opening penalty

22 Web servers for pairwise alignment

23 BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment BLAST  Does not use an exact algorithm but a heuristic

24 Back to NCBI

25 BLAST – bl2seq

26 blastn – nucleotide blastp – protein Bl2Seq - query

27 Bl2seq results

28 Match Dissimilarity Gaps Similarity Low complexity

29 Bl2seq results:  Bits score – A score for the alignment according to the number of similarities, identities, etc.  Bits score – A score for the alignment according to the number of similarities, identities, etc.  Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a database of a particular size. The closer the e- value approaches zero, the greater the confidence that the hit is real

30 BLAST – programs Query:DNAProtein Database:DNAProtein

31 BLAST – Blastp

32 Blastp - results

33 Blastp – results (cont’)

34 Blastp – acquiring sequences

35 blastp – acquiring sequences (cont’)

36 Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

37 Searching for remote homologs  Sometimes BLAST isn’t enough  Large protein family, and BLAST only finds close members. We want more distant members  PSI-BLAST  Profile HMMs (not discussed in this exercise)

38 PSI-BLAST  Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

39 PSI-BLAST  Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends  Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration

40 BLAST – PSI-Blast

41 PSI-Blast - results


Download ppt "|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG."

Similar presentations


Ads by Google