Sequence evolution and homology identification

Sequence evolution and homology identification
Xuhua Xia

Nucleotide substitution
- usually too slow to monitor directly… spontaneous mutation rates? p for mammalian nuclear DNA (regions not under functional constraint) ~ 4 x nt sub per site per year ... much higher for viruses eg to nt sub per site per generation … so use comparative analysis of 2 sequences which share a common ancestor determine number and nature of nt substitutions that have occurred (ie measure degree of divergence) Xuhua Xia

Potential pitfalls 1. Are all evolutionary changes being monitored?
- if closely-related, high probability only one change at any given site… but if distant, may have been multiple substitutions (“hits”) at a site - can use algorithms to correct for this 2. If indels between two sequences, can they be aligned with confidence? - algorithms with gap penalties Xuhua Xia

13 substitutions occurred but only 3 discovered by sequence comparison
Ancestral sequence 13 substitutions occurred but only 3 discovered by sequence comparison Present day sequences Fig. 3.6 Xuhua Xia

Multiple hits Page & Holmes Fig. 5.9
Homoplasy: same nt, but not directly inherited from ancestral sequence (If comparing long stretches, highly unlikely they would have converged to the same sequence) Page & Holmes Fig. 5.9 Xuhua Xia

Uncertainty in codon substitution
Pathway I: CCC(Pro)CCA(Pro) CAA(Gln) Pathway II: CCC(Pro)CAC(His) CAA(Gln) Is one pathway more likely than another? p.82 Xuhua Xia

The purpose of sequence alignment
Identification of sequence homology and homologous sites Homology: similarity that is the result of inheritance from a common ancestor (identification and analysis of homologies is central to phylogenetics). An Alignment is an hypothesis of positional homology between bases/Amino Acids. Xuhua Xia

Normal and Thalassemia HBb
----|----|----|----|----|----|----|----|----|----|----|----|-- Normal AUGGUGCACCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGU Thalass. AUGGUGCACCUGACUCCUGAGGAGAAGUCUGCCGUUACUGCCCUGUGGGGCAAGGUGAACGU ************************************************************** --|----|----|----|----|----|----|----|----|----|----|----|---- Normal GGAUGAAGUUGGUGGU-GAGGCCCUGGGCAGGUUGGUAUCAAGGUUACAAGACAGG...... Thalass. GGAUGAAGUUGGUGGUUGAGGCCCUGGGCAGGUUGGUAUCAAGGUUACAAGACAGG...... **************** *************************************** Are the two genes homologous? What evolutionary change can you infer from the alignment? What is the consequence of the evolutionary change? Xuhua Xia

Janeka, JE et al. 2007 Science 318:792
Xuhua Xia

What is an optimal alignment?
Alignment1: Favorite Favourite Alignment2: ---Favorite Favourite-- Alignment3: Favorite Favourite Alignment4: Favo-rite Favourite An optimal alignment One with maximum number of matches and minimum number of mismatches and gaps Operational definition: one with highest alignment score given a particular scoring scheme (e.g., match: 2, mismatch: -1, gap: -2) Which of the 4 alignments above is the optimal alignment? Changing the scoring scheme may change the optimal alignment Xuhua Xia

Importance of scoring schemes
Two alternative alignments: Alignment 1: ACCCAGGGCTTA ACCCGGGCTTAG Alignment 2: ACCCAGGGCTTA ACCC-GGGCTTAG Scoring scheme 1: Match: 2, mismatch: 0, gap: -5 Scoring scheme 2: Match: 2, mismatch: -1, gap: -1 Which of the two is the optimal alignment according to scoring scheme 1? Which according to scoring scheme 2? Importance of biological input Xuhua Xia

Dynamic Programming Constant gap penalty: Scoring scheme:
Match (M): 2 Mismatch (MM): Gap (G): -2 For each cell, compute three values: Upleft value + IF(Match, M, MM) Left value + G Up value + G Xuhua Xia

Alignment with secondary structure
Sequences: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAATCCGTG Seq1: CACGA ||||| GUGCU Seq2: CACGA ||||| GUGCU Seq2: CACGG ||||| GUGCC CCAAUC Deletion CCAAU Missing link CCAAU Correlated substitution Conventional alignment: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAATC-CGTG Correct alignment: Seq1: CACGACCAATCTCGTG Seq2: CACGGCCAAT-CCGTG Hickson et al., 2000; Kjer, 1995; Notredame et al., 1997 Xuhua Xia

Multiple Alignment: Guide Tree
ATTCCAAG... ATTTCCAAG... ATTCCCAAG... ATCGGAAG... ATCCGAAG... ATCCAAAG... AATTCCAAG... AATTTCCAAG... AATTCCCAAG... AAGTCCAAG... AAGTCAAG... ATT-CC-AAG... ATTTCC-AAG... ATTCCC-AAG... AT--CGGAAG... AT-CCG-AAG... AT-CCA-AAG... AATTCC-AAG... AATTTCCAAG... AATTCCCAAG... AAGTCC-AAG... AAGTC--AAG... Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 Seq9 Seq10 Seq11 Seq12 Seq13 Seq14 Seq15 Seq16 Xuhua Xia

Aligned Sequences Xuhua Xia

CLUSTAL * : . Symbols used?
CLUSTAL W (1.81) Multiple Sequence Alignments Sequence 1: ArabidopsisAAG aa Sequence 2: ArabidopsisAAC aa Sequence 3: yeast aa ArabAAG FIVDEADLLLDLGFRRDVEKIIDCLPRQR QSLLFSATIPKEVRRVS-QLVLKR 539 ArabAAC FIVDEADLLLDLGFKRDVEKIIDCLPRQR QSLLFSATIPKEVRRVS-QLVLKR 586 yeast VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.* . * .: ::******: .:*:::: ::: *: : . Symbols used? * Xuhua Xia

Binomial distribution
Toss a fair coin once. What is the probability of having a head (H)? Toss a fair coin twice. What is the probability of having HH, HT or TT (T for tail)? ( )2 Toss a fair coin 10 times and record the outcome, e.g., HTHHTHHTTT. What is the expected number of HH occurring in the string? E = n  p = (10 – 2 +1)0.52 What is the probability of 0, 1, 2, ..., 9 matches? (p + q)n = ( )9 Xuhua Xia

Basic stats in string matching
Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is: prob = PA (PC)2 PG (PT)2 Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATG Target CGATTGCCCG The probability distribution of the number of matches follows a binomial distribution with p = prob and n = (M – N +1) Xuhua Xia

Basic stats in string matching
Computation involving a binomial distribution requires taking the factorial of n. When n is large (e.g., > 1000), this becomes impractical or impossible for today’s computers. Approximation is therefore necessary. When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq When np < 1 and n is very large (which implies that p is very small given np < 1), binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e.,  = 2 = np). Xuhua Xia

From Binomial to Poisson
Xuhua Xia

Matching two sequences without gap
Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. M: query length; N: target length The query can start at (M – L +1) positions and the target can start at (N – L +1) positions. These two values are called effective lengths of the two sequences and designated m and n, respectively. They are shorter than M and N, respectively. The expected number of matches with length L is mn2-S, which is called e-value in ungapped BLAST. S is calculated differently in the gapped BLAST Xuhua Xia

Expected number of matches: No gap
T: GGTTACACGAGTGCTG |||||||||||||||| Q:CTGAGGTTACACGAGTGCTGCTGA M = 24, L = 16 m = M – L + 1 = = 9 E = m2-S = 92-216 = E-09 T:AACCGGTTACACGAGTGCTGAATGC |||||||||||||||| Q:CTGAGGTTACACGAGTGCTGCTGA M = 25, N = 24, L = 16 m = M – L + 1 = = 10 n = N – L + 1 = = 9 E = mn2-S = 1092-216 = E-08 Xuhua Xia

Gapped BLAST Adapted from Crane & Raymer 2003
Input sequence: AILVPTVIGCTVPT Algorithm: Break the query sequence into words: AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT Discard common words (i.e., words made entirely of common amino acids) or sequence of low complexity Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming: AILVPTVIGCTVPT MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC Xuhua Xia

Blast Output (Nuc. Seq.) BLASTN 2.2.4 [Aug-26-2002] ... Query= Seq1 38
Database: MgCDS 480 sequences; 526,317 total letters Score E Sequences producing significant alignments: (bits) Value MG bases e-004 Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| Sbjct: 1 atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 … effective length of query: 26 effective length of database: 520,557 Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = = 17 S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * * 2-34 = 7.878E-04 x p(x) Xuhua Xia

(which is not homologous to #1)
Why is “sequence complexity” important when judging whether two sequences are homologous? … or whether their similarity is by chance Human DNA Chimp DNA Pu-rich region #1 Pu-rich region #2 (which is not homologous to #1) AAGAGGAG Region of unbiased base composition where G=C=A=T How frequently is such an 8-nt seq (AAGAGGAG) expected to occur by chance in a DNA sequence? If within unbiased region? 1 / 48 ... once every 65.5 kb on average If within Pu-rich region? 1 / 28 ... once every 256 bp on average If both sequences are purine rich, then high % identity (or a small e-value) may not reflect shared evolutionary origin

E-Value in BLAST The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. It is NOT the probability of finding the reported match. Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide). Xuhua Xia

E-value and P(1) Xuhua Xia

Problem with BLAST Q: GGC GCG CCC AAG CUG UGC T: GGT GCA CCT AAA CUA UGT Alternatives: FASTA AA sequence Xuhua Xia

Sequence evolution and homology identification

Similar presentations

Presentation on theme: "Sequence evolution and homology identification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence evolution and homology identification

Similar presentations

Presentation on theme: "Sequence evolution and homology identification"— Presentation transcript:

Similar presentations

About project

Feedback