Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Scoring Matrices.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
Sequence analysis June 19, 2007 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence analysis June 17, 2003 Learning objectives-Review amino acids structures. Understand sliding window programs. Understand difference between identity,
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Introduction to bioinformatics
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
Sequence similarity.
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Sequence comparisons June 23, 2009 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Designed by Manisha, NUS Part I : SEQUENCE COMPARISON PAIRWISE ALIGNMENT Manisha Brahmachary.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Proteins dictate function in an organism:
Biology 4900 Biocomputing.
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
©CMBI 2009 Alignment & Secondary Structure You have learned about: Data & databases Tools Amino Acids Protein Structure Today we will discuss: Aligning.
Pairwise Sequence Analysis-III
In-Class Assignment #1: Research CD2
Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity, BLAST alignments & multiple sequence alignments
Protein Sequence Alignments
Alignment IV BLOSUM Matrices
Presentation transcript:

Alignment methods April 12, 2005 Return Homework (Ave. = 7.5) Reminder: Quiz on Thurs. April 14 Learning objectives- Understand difference between identity, similarity and homology. PAM scoring matrices. Understand difference between global alignment and local alignment. Review of Dotter software program. Workshop-Import sequences of interest from GenBank, place in FASTA format, align sequences using DOTTER program. Homework #4 due on Tues, April 19 at the beginning of class.

Infer structural information Infer functional information Purpose of finding differences and similarities of amino acids in two proteins. Infer structural information Infer functional information Infer evolutionary relationships

Evolutionary Basis of Sequence Alignment Similarity: Quantity that relates how much two amino acid sequences are alike. 2. Identity: Quantity that describes how much two sequences are alike in the strictest terms. 3. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history.

Evolutionary Basis of Sequence Alignment (Cont. 1) Why are there regions of identity? 1) Conserved function-residues participate in reaction. 2) Structural (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene.

One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity.

Evolutionary Basis of Sequence Alignment (Cont. 2) Note: it is possible that two proteins share a high degree of similarity but have two different functions. For example, human gamma-crystallin is a lens protein that has no known enzymatic activity. It shares a high percentage of identity with E. coli quinone oxidoreductase. These proteins likely had a common ancestor but their functions diverged. Analogous to railroad car and diner function.

Modular nature of proteins The previous alignment was global. However, many proteins do not display global patterns of similarity. Instead, they possess local regions of similarity. Proteins can be thought of as assemblies of modular domains. THINK OF MR. POTATOHEAD. It is thought that this may, in some cases, be due to a process known as exon shuffling.

Modular nature of proteins (cont. 1) Gene A Exon 1a Exon 2a Duplication of Exon 2a Gene A Exon 1a Exon 2a Exon 2a Exchange with Gene B Gene B Exon 1b Exon 2b Exon 2b Exon 3 (Exon 2b from Gene B) Gene A Exon 1a Exon 2a Gene B Exon 1b Exon 2b Exon 3 (Exon 2a from Gene A)

Scoring Matrices Importance of scoring matrices Scoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice.

Identity Matrix A 1 C 1 I 1 L 1 A C I L 1 I 1 L 1 A C I L Simplest type of scoring matrix

Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. +NH3 CO2- +NH3 CO2- Isoleucine Leucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between?

Scoring Matrices When we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix. For example, M11 refers to the entry at the first row and the first column. In general, Mij refers to the entry at the ith row and the jth column. To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence.

Two major scoring matrices for amino acid sequence comparisons PAM-derived from sequences known to be closely related (Eg. Proteins from chimpanzees and human). PAM1 was created from empirical data and other PAMs were mathematically derived. BLOSUM-derived from sequences not closely related (Eg. E. coli and human) from data stored in the BLOCKS database.

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Started by Margaret Dayhoff, 1978 A series of matrices describing the extent to which two amino acids have been interchanged in evolution. Proteins were aligned by eye and then the number of times an amino acid was substituted in different species was counted.

Protein families used to construct Dayhoff’s scoring matrix Protein PAMs per 100 mil yrs IgG kappa C region 37 Kappa casein 33 Serum Albumin 26 Cytochrome C 0.9 Histone H3 0.14 Histone H4 0.10

Numbers of accepted point mutations, multiplied by 10 A R N D C Q E G H I L K M F P S T W Y V A R 30 N 109 17 D 154 0 532 C 33 10 0 0 Q 93 120 50 76 0 E 266 0 94 831 0 422 G 579 10 156 162 10 30 112 H 21 103 226 43 10 243 23 10 I 66 30 36 13 17 8 35 0 3 L 95 17 37 0 0 75 15 17 40 253 K 57 477 322 85 0 147 104 60 23 43 39 M 29 17 0 0 0 20 7 7 0 57 207 90 F 20 7 7 0 0 0 0 17 20 90 167 0 17 P 345 67 27 10 10 93 40 49 50 7 43 43 4 7 S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269 T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696 W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0 Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6 V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17 Original amino acid Replacement amino acid

Calculation of relative mutability of amino acid Find frequency of amino acid change to another amino acid at a certain position in protein. Divide the frequency of aa change by the frequency that the “j” (original) aa occurs in all proteins studied. This is called the “mutability”. Determine the factor to multiply the alanine mutability to get 100. Multiply the 19 other a.a. mutabilities by the same factor. This is called the relative mutability

Relative mutabilities of amino acids Asn 134 Ser 120 Asp 106 Glu 102 Ala 100 Thr 97 Ile 96 Met 94 Gln 93 Val 74 His 66 Arg 65 Lys 56 Pro 56 Gly 49 Tyr 41 Phe 41 Leu 40 Cys 20 Trp 18

Why are the mutabilities different? High mutabilities because a similar amino acid can replace it. (Asp for Glu) Conversely, the low mutabilities are unique, can’t be replaced.

Creation of a mutation probability matrix Used accepted mutation data from earlier slide and the mutability of each amino acid in nature to create a mutation probability matrix. Mij shows the probability that an original amino acid j (in columns) will be replaced by amino acid i (in rows) over a defined evolutionary interval. For PAM1, 1% of aa’s have been changed.

PAM1 mutational probability matrix Values of each column will sum to 10,000

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix A 1-PAM unit is equivalent to 1 mutation found in a stretch of 2 sequences each containing 100 amino acids that are aligned Example 1: ..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. |||||||||||||| ||||||||||||||||||||||||||||||||||| ..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. length = 100, 1 Mismatch, PAM distance = 1 A k-PAM unit is equivalent to k 1-PAM units (or Mk).

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Observed % Difference Evolutionary Distance In PAMs 1 5 10 20 40 50 60 70 80 1 5 11 23 56 80 112 159 246

Final Scoring Matrix is the Log-Odds Scoring Matrix S (a,b) = 10 log10(Mab/Pb) Replacement amino acid Original amino acid Frequency of amino acid b Mutational probability matrix number

Summary of PAM Scoring Matrix PAM = a unit of evolution (1 PAM = 1 point mutation/100 amino acids) Accepted Mutation means fixed point mutation Comparison of 71 groups of closely related proteins yielding 1,572 changes. (>85% identity) Different PAM matrices are derived from the PAM 1 matrix by matrix multiplication. The matrices are converted to log odds scoring matrices. (Frequency of change divided by probability of chance alignment converted to log base 10.) A PAM 250 matrix is roughly equivalent to 20% identity in two sequences.

The Dotter Program Program consists of three components: Sliding window A table that gives a score for each amino acid match A graph that converts the score to a dot of certain density. The higher the density the higher the score.

Two proteins that are similar in certain regions Tissue plasminogen activator (PLAT) Coagulation factor 12 (F12).

Single region on F12 is similar to two regions on PLAT Region of similarity

FASTA format >gi|1244762|gb|AAA98563.1| p53 tumor suppressor homolog MSQGTSPNSQETFNLLWDSLEQVTANEYTQIHERGVGYEYHEAEPDQTSLEISAYRIAQPDPYGRSESYD LLNPIINQIPAPMPIADTQNNPLVNHCPYEDMPVSSTPYSPHDHVQSPQPSVPSNIKYPGEYVFEMSFAQ PSKETKSTTWTYSEKLDKLYVRMATTCPVRFKTARPPPSGCQIRAMPIYMKPEHVQEVVKRCPNHATAKE HNEKHPAPLHIVRCEHKLAKYHEDKYSGRQSVLIPHEMPQAGSEWVVNLYQFMCLGSCVGGPNRRPIQLV FTLEKDNQVLGRRAVEVRICACPGRDRKADEKASLVSKPPSPKKNGFPQRSLVLTNDITKITPKKRKIDD ECFTLKVRGRENYEILCKLRDIMELAARIPEAERLLYKQERQAPIGRLTSLPSSSSNGSQDGSRSSTAFS TSDSSQVNSSQNNTQMVNGQVPHEEETPVTKCEPTENTIAQWLTKLGLQAYIDNFQQKGLHNMFQLDEFT LEDLQSMRIGTGHRNKIWKSLLDYRRLLSSGTESQALQHAASNASTLSVGSQNSYCPGFYEVTRYTYKHT ISYL

Workshop 3