Download presentation
Presentation is loading. Please wait.
Published byWinifred Weaver Modified over 6 years ago
1
Techniques & Applications of Sequence Comparison
Copyright © 2004 by Limsoon Wong Limsoon Wong Institute for Infocomm Research 29 September 2004
2
Lecture Plan Basic sequence comparison methods Popular tools
Pairwise alignment, Multiple alignment Popular tools FASTA, BLAST, Pattern Hunter Applications Homologs, Active sites, Key mutation sites, Looking for SNPs, Determining origin of species, … More advanced tools PHI BLAST, ISS, PSI-BLAST, SAM, ... Copyright © 2004 by Limsoon Wong
3
Basic Sequence Comparison Methods
Copyright © 2004 by Limsoon Wong A brief refresher
4
Sequence Comparison: Motivations
DNA is blue print for living organisms Evolution is related to changes in DNA By comparing DNA sequences we can infer evolutionary relationships between the sequences w/o knowledge of the evolutionary events themselves Foundation for inferring function, active site, and key mutations Copyright © 2004 by Limsoon Wong
5
Alignment Key aspect of sequence comparison is sequence alignment
A sequence alignment maximizes the number of positions that are in agreement in two sequences Sequence U Sequence V mismatch match indel Copyright © 2004 by Limsoon Wong
6
Alignment: Poor Example
Poor seq alignment shows few matched positions The two proteins are not likely to be homologous No obvious match between Amicyanin and Ascorbate Oxidase Copyright © 2004 by Limsoon Wong
7
Alignment: Good Example
Good alignment usually has clusters of extensive matched positions The two proteins are likely to be homologous good match between Amicyanin and unknown M. loti protein Copyright © 2004 by Limsoon Wong
8
Alignment: A Formalization
Copyright © 2004 by Limsoon Wong
9
Alignment: Simple-minded Probability & Score
h Define score S(A) by simple log likelihood as S(A) = log(prob(A)) - [m log(s) + h log(s)], with log(p/s) = 1 Then S(A) = #matches - #mismatches - #indels Copyright © 2004 by Limsoon Wong
10
Global Pairwise Alignment: Problem Definition
Given sequences U and V of lengths n and m, then number of possible alignments is given by f(n, m) = f(n-1,m) + f(n-1,m-1) + f(n,m-1) f(n,n) ~ (1 + 2)2n+1 n-1/2 The problem of finding a global pairwise alignment is to find an alignment A so that S(A) is max among exponential number of possible alternatives Copyright © 2004 by Limsoon Wong
11
Global Pairwise Alignment: Dynamic Programming Solution
Define an indel-similarity matrix s(.,.); e.g., s(x,x) = 1 s(x,y) = -, if x y Then Copyright © 2004 by Limsoon Wong
12
Global Pairwise Alignment: More Realistic Handling of Indels
In Nature, indels of several adjacent letters are not the sum of single indels, but the result of one event So reformulate as follows: Copyright © 2004 by Limsoon Wong
13
Variations of Pairwise Alignment
Fitting a “short’’ seq to a “long’’ seq. Indels at beginning and end are not penalized Find “local” alignment find i, j, k, l, so that S(A) is maximized, A is alignment of ui…uj and vk…vl U V U V Copyright © 2004 by Limsoon Wong
14
Multiple Alignment Multiple seq alignment maximizes number of positions in agreement across several seqs seqs belonging to same “family” usually have more conserved positions in a multiple seq alignment Conserved sites Copyright © 2004 by Limsoon Wong
15
Multiple Alignment: A Formalization
Copyright © 2004 by Limsoon Wong
16
Multiple Alignment: Naïve Approach
Let S(A) be the score of a multiple alignment A. The optimal multiple alignment A of sequences U1, …, Ur can be extracted from the following dynamic programming computation of Sm1,…,mr: This requires O(2r) steps Exercise: Propose a practical approximation Copyright © 2004 by Limsoon Wong
17
Phylogeny By looking at extent of conserved positions in the multiple seq alignment of different groups of seqs, can infer when they last shared an ancestor Construct “family tree” or phylogeny Copyright © 2004 by Limsoon Wong
18
Popular Tools for Sequence Comparison: FASTA, BLAST, Pattern Hunter
Copyright © 2004 by Limsoon Wong Acknowledgements: Some slides here are “borrowed” from Bin Ma & Dong Xu
19
Scalability of Software
30 billion nt in year 2005 Increasing number of sequenced genomes: yeast, human, rice, mouse, fly, … S/w must be “linearly” scalable to large datasets Copyright © 2004 by Limsoon Wong
20
Need Heuristics for Sequence Comparison
Time complexity for optimal alignment is O(n2) , where n is sequence length Given current size of sequence databases, use of optimal algorithms is not practical for database search Heuristic techniques: BLAST FASTA Pattern Hunter MUMmer, ... Speed up: 20 min (optimal alignment) 2 min (FASTA) 20 sec (BLAST) Copyright © 2004 by Limsoon Wong
21
Basic Idea: Indexing & Filtering
Good alignment includes short identical, or similar fragments Break entire string into substrings, index the substrings Search for matching short substrings and use as seed for further analysis Extend to entire string find the most significant local alignment segment Copyright © 2004 by Limsoon Wong
22
FASTA in 4 Steps Lipman & Pearson, Science 227:1435-1441, 1985
Find regions of seqs w/ highest density of matches Exact matches of a given length (by default 2 for proteins, 6 for nucleic acids) are determined Regions w/ a high number of matches are selected Copyright © 2004 by Limsoon Wong
23
FASTA in 4 Steps Lipman & Pearson, Science 227:1435-1441, 1985
Rescan 10 regions w/ highest density of identities using BLOSUM50 matrix Trim ends of region to include only those residues contributing to the highest score Each region is a partial alignment w/o gaps Copyright © 2004 by Limsoon Wong
24
FASTA in 4 Steps Lipman & Pearson, Science 227:1435-1441, 1985
If there are several regions w/ scores greater than a cut off value, try to join these regions. A score for the joined initial regions is calculated given a penalty for each gap Copyright © 2004 by Limsoon Wong
25
FASTA in 4 Steps Lipman & Pearson, Science 227:1435-1441, 1985
Select seqs in db w/ highest score For each seq construct a Smith-Waterman optimal alignment considering only positions that lie in a band centred on the best initial region found Copyright © 2004 by Limsoon Wong
26
BLAST in 3 Steps Altschul et al, JMB 215:403-410, 1990
Word matching as FASTA Similarity matching of words (3 aa’s, 11 bases) no need identical words If no words are similar, then no alignment won’t find matches for very short sequences MSP: Highest scoring pair of segments of identical length. A segment pair is locally maximal if it cannot be improved by extending or shortening the segments Find alignments w/ optimal max segment pair (MSP) score Gaps not allowed Homologous seqs will contain a MSP w/ a high score; others will be filtered out Copyright © 2004 by Limsoon Wong
27
BLAST in 3 Steps Altschul et al, JMB 215:403-410, 1990
For the query, find the list of high scoring words of length w Image credit: Barton Copyright © 2004 by Limsoon Wong
28
BLAST in 3 Steps Altschul et al, JMB 215:403-410, 1990
Compare word list to db & find exact matches Image credit: Barton Copyright © 2004 by Limsoon Wong
29
BLAST in 3 Steps Altschul et al, JMB 215:403-410, 1990
For each word match, extend alignment in both directions to find alignment that score greater than a threshold s Image credit: Barton Copyright © 2004 by Limsoon Wong
30
Spaced Seeds is an example of a spaced seed model with 11 required matches (weight=11) 7 “don’t care” positions GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… is the BLAST seed model for comparing DNA seqs Copyright © 2004 by Limsoon Wong
31
Observations on Spaced Seeds
Seed models w/ different shapes can detect different homologies the 3rd base in a codon “wobbles” so a seed like … should be more sensitive when matching coding regions Some models detect more homologies More sensitive homology search PatternHunter I Use >1 seed models to hit more homologies Approaching 100% sensitive homology search PatternHunter II Copyright © 2004 by Limsoon Wong
32
PatternHunter I Ma et al., Bioinformatics 18:440-445, 2002
BLAST’s seed usually uses more than one hits to detect one homology Wasteful Spaced seeds uses fewer hits to detect one homology Efficient CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? TTGACCTCACC? |||||||||||? 1/4 chances to have 2nd hit next to the 1st hit 1/46 chances to have 2nd hit next to the 1st hit Copyright © 2004 by Limsoon Wong
33
PatternHunter I Ma et al., Bioinformatics 18:440-445, 2002
Proposition. The expected number of hits of a weight-W length-M model within a length-L region of similarity p is (L – M + 1) * pW Proof. For any fixed position, the prob of a hit is pW. There are L – M + 1 candidate positions. The proposition follows. Copyright © 2004 by Limsoon Wong
34
Implication Spaced For L = 1017 seeds likely to be more sensitive
efficient For L = 1017 BLAST seed expects (1017 – ) * p11 = 1007 * p11 hits But ~1/4 of these overlap each other. So likely to have only ~750 * p11 distinct hits Our example spaced seed expects (1017 – ) * p11 = 1000 * p11 hits But only 1/46 of these overlap each other. So likely to have ~1000 * p11 distinct hits Copyright © 2004 by Limsoon Wong
35
Sensitivity of PatternHunter I
Image credit: Li Copyright © 2004 by Limsoon Wong
36
Speed of PatternHunter I
Mouse Genome Consortium used PatternHunter to compare mouse genome & human genome PatternHunter did the job in a 20 CPU-days ---it would have taken BLAST 20 CPU- years! Nature, 420: , 2002 Copyright © 2004 by Limsoon Wong
37
How to Increase Sensitivity?
Ways to increase sensitivity: “Optimal” seed Reduce weight by 1 Increase number of spaced seeds by 1 For L = 1017 & p = 50% 1 weight-11 length-18 model expects 1000/211 hits 2 weight-12 length-18 models expect 2 * 1000/212 = 1000/211 hits When comparing regions w/ >50% similarity, using 2 weight-12 spaced seeds together is more sensitive than using 1 weight-11 spaced seed! Copyright © 2004 by Limsoon Wong
38
PatternHunter II Li et al, GIW, 164-175, 2003
Idea Select a group of spaced seed models For each hit of each model, conduct extension to find a homology Selecting optimal multiple seeds is NP-hard Algorithm to select multiple spaced seeds Let A be an empty set Let s be the seed such that A ⋃ {s} has the highest hit probability A = A ⋃ {s} Repeat until |A| = K Computing hit probability of multiple seeds is NP-hard Copyright © 2004 by Limsoon Wong
39
Sensitivity of PatternHunter II
One weight-12 Two weight-12 One weight-11 Solid curves: Multiple (1, 2, 4, 8,16) weight-12 spaced seeds Dashed curves: Optimal spaced seeds with weight = 11,10, 9, 8 “Doubling the seed number” gains better sensitivity than “decreasing the weight by 1” Image credit: Ma sensitivity Copyright © 2004 by Limsoon Wong
40
Expts on Real Data 30k mouse ESTs (25Mb) vs 4k human ESTs (3Mb)
downloaded from NCBI genbank “low complexity” regions filtered out SSearch (Smith-Waterman method) finds “all” pairs of ESTs with significant local alignments Check how many percent of these pairs can be “found” by BLAST and different configurations of PatternHunter II Copyright © 2004 by Limsoon Wong
41
Results In fact, at 80% similarity, 100% sensitivity can be achieved
using 40 weight-9 seeds Results Image credit: Ma Copyright © 2004 by Limsoon Wong
42
Farewell to the Supercomputer Age of Sequence Comparison!
Copyright © 2004 by Limsoon Wong Image credit: Bioinformatics Solutions Inc
43
Application of Sequence Comparison: Guilt-by-Association
Copyright © 2004 by Limsoon Wong
44
Emerging Patterns An emerging pattern is a pattern that occurs significantly more frequently in one class of data compared to other classes of data A lot of biological sequence analysis problems can be thought of as extracting emerging patterns from sequence comparison results Copyright © 2004 by Limsoon Wong
45
A protein is a ... A protein is a large complex molecule made up of one or more chains of amino acids Protein performs a wide variety of activities in the cell Copyright © 2004 by Limsoon Wong
46
Function Assignment to Protein Sequence
SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNR YVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWE QNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGD VTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTG TFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELE VT How do we attempt to assign a function to a new protein sequence? Copyright © 2004 by Limsoon Wong
47
Guilt-by-Association
Compare the target sequence T with sequences S1, …, Sn of known function in a database Determine which ones amongst S1, …, Sn are the mostly likely homologs of T Then assign to T the same function as these homologs Finally, confirm with suitable wet experiments Copyright © 2004 by Limsoon Wong
48
Guilt-by-Association
Compare T with seqs of known function in a db Assign to T same function as homologs Confirm with suitable wet experiments Discard this function as a candidate Copyright © 2004 by Limsoon Wong
49
BLAST: How it works Altschul et al., JMB, 215:403--410, 1990
BLAST is one of the most popular tool for doing “guilt-by-association” sequence homology search find from db seqs with short perfect matches to query seq (Exercise: Why do we need this step?) find seqs with good flanking alignment Copyright © 2004 by Limsoon Wong
50
Homologs obtained by BLAST
Thus our example sequence could be a protein tyrosine phosphatase (PTP) Copyright © 2004 by Limsoon Wong
51
Example Alignment with PTP
Copyright © 2004 by Limsoon Wong
52
Guilt-by-Association: Caveats
Ensure that the effect of database size has been accounted for Ensure that the function of the homology is not derived via invalid “transitive assignment’’ Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain Copyright © 2004 by Limsoon Wong
53
Interpretation of P-value
Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit P-value is interpreted as prob. that a random seq. has an equally good alignment Suppose the P-value of an alignment is 10-6 If database has 107 seqs, then you expect 107 * 10-6 = 10 seqs in it that give an equally good alignment Need to correct for database size if your seq. comparison prog does not do that! Copyright © 2004 by Limsoon Wong
54
A partial list of IMPdehydrogenase misnomers
Examples of Invalid Function Assignment: The IMP dehydrogenases (IMPDH) A partial list of IMPdehydrogenase misnomers in complete genomes remaining in some public databases Copyright © 2004 by Limsoon Wong
55
IMPDH Domain Structure
Typical IMPDHs have 2 IMPDH domains that form the catalytic core and 2 CBS domains. A less common but functional IMPDH (E70218) lacks the CBS domains. Misnomers show similarity to the CBS domains Copyright © 2004 by Limsoon Wong
56
Invalid Transitive Assignment
Mis-assignment of function A B C Root of invalid transitive assignment No IMPDH domain Copyright © 2004 by Limsoon Wong
57
Emerging Pattern Typical IMPDH Functional IMPDH w/o CBS
Most IMPDHs have 2 IMPDH and 2 CBS domains. Some IMPDH (E70218) lacks CBS domains. IMPDH domain is the emerging pattern Copyright © 2004 by Limsoon Wong
58
Application of Sequence Comparison: Active Site/Domain Discovery
Copyright © 2004 by Limsoon Wong
59
Discover Active Site and/or Domain
How to discover the active site and/or domain of a function in the first place? Multiple alignment of homologous seqs Determine conserved positions Emerging patterns relative to background Candidate active sites and/or domains Easier if sequences of distance homologs are used Copyright © 2004 by Limsoon Wong
60
Multiple Alignment of PTPs
Notice the PTPs agree with each other on some positions more than other positions These positions are more impt wrt PTPs Else they wouldn’t be conserved by evolution They are candidate active sites Copyright © 2004 by Limsoon Wong
61
Guilt-by-Association: What if no homolog of known function is found?
Copyright © 2004 by Limsoon Wong genome phylogenetic profiles protfun’s feature profiles
62
Phylogenetic Profiling Pellegrini et al., PNAS, 96:4285--4288, 1999
Gene (and hence proteins) with identical patterns of occurrence across phyla tend to function together Even if no homolog with known function is available, it is still possible to infer function of a protein Copyright © 2004 by Limsoon Wong
63
Phylogenetic Profiling: How it Works
Image credit: Pellegrini Phylogenetic Profiling: How it Works Copyright © 2004 by Limsoon Wong
64
Phylogenetic Profiling: P-value
No. of ways to distribute z co-occurrences over N lineage's No. of ways to distribute the remaining x – z and y – z occurrences over the remaining N – z lineage's No. of ways of distributing X and Y over N lineage's without restriction z Copyright © 2004 by Limsoon Wong
65
Phylogenetic Profiles: Evidence Pellegrini et al
Phylogenetic Profiles: Evidence Pellegrini et al., PNAS, 96: , 1999 No. of protein pairs in this group that differ by < 3 “bit” in random group that differ proteins Proteins grouped based on similar keywords in SWISS-PROT have more similar phylogenetic profiles Copyright © 2004 by Limsoon Wong
66
and share a common pathway having hamming distance D
Phylogenetic Profiling: Evidence Wu et al., Bioinformatics, 19: , 2003 KEGG COG and share a common pathway having hamming distance D fraction of gene pairs in KEGG/COG hamming distance (D) hamming distance X,Y = #lineages X occurs + #lineages Y occurs – 2 * #lineages X, Y occur Proteins having low hamming distance (thus highly similar phylogenetic profiles) tend to share common pathways Exercise: Why do proteins having high hamming distance also have this behaviour? Copyright © 2004 by Limsoon Wong
67
The ProtFun Approach Jensen, JMB, 319:1257--1265, 2002
A protein is not alone when performing its biological function It operates using the same cellular machinery for modification and sorting as all other proteins do, such as glycosylation, phospharylation, signal peptide cleavage, … These have associated consensus motifs, patterns, etc. seq1 Proteins performing similar functions should share some such “features” Perhaps we can predict protein function by comparing its “feature” profile with other proteins? Copyright © 2004 by Limsoon Wong
68
ProtFun: How it Works Extract feature profile of protein using various
Average the output of the 5 component ANNs Extract feature profile of protein using various prediction methods Copyright © 2004 by Limsoon Wong
69
ProtFun: Evidence Some combinations of “features” seem to characterize some functional categories Copyright © 2004 by Limsoon Wong
70
ProtFun: Example Output
At the seq level, Prion, A4, & TTHY are dissimilar ProtFun predicts them to be cell envelope-related, tranport & binding This is in agreement with known functionality of these proteins Copyright © 2004 by Limsoon Wong
71
ProtFun: Performance Copyright © 2004 by Limsoon Wong
72
Application of Sequence Comparison: Key Mutation Site Discovery
Copyright © 2004 by Limsoon Wong
73
Identifying Key Mutation Sites K. L. Lim et al
Identifying Key Mutation Sites K.L. Lim et al., JBC, 273: , 1998 Sequence from a typical PTP domain D2 Some PTPs have 2 PTP domains PTP domain D1 is has much more activity than PTP domain D2 Why? And how do you figure that out? Copyright © 2004 by Limsoon Wong
74
Emerging Patterns of PTP D1 vs D2
Collect example PTP D1 sequences Collect example PTP D2 sequences Make multiple alignment A1 of PTP D1 Make multiple alignment A2 of PTP D2 Are there positions conserved in A1 that are violated in A2? These are candidate mutations that cause PTP activity to weaken Confirm by wet experiments Copyright © 2004 by Limsoon Wong
75
Emerging Patterns of PTP D1 vs D2
present absent D1 D2 This site is consistently conserved in D1, but is consistently missing in D2 it is an EP possible cause of D2’s loss of function but is not consistently missing in D2 it is not an EP not a likely cause of D2’s loss of function Copyright © 2004 by Limsoon Wong
76
Key Mutation Site: PTP D1 vs D2
Positions marked by “!” and “?” are likely places responsible for reduced PTP activity All PTP D1 agree on them All PTP D2 disagree on them Copyright © 2004 by Limsoon Wong
77
Key Mutation Site: PTP D1 vs D2
Image credit: Kolatkar Positions marked by “!” are even more likely as 3D modeling predicts they induce large distortion to structure Copyright © 2004 by Limsoon Wong
78
Confirmation by Mutagenesis Expt
What wet experiments are needed to confirm the prediction? Mutate E D in D2 and see if there is gain in PTP activity Mutate D E in D1 and see if there is loss in PTP activity Exercise: Why do you need this 2-way expt? Copyright © 2004 by Limsoon Wong
79
Application of Sequence Comparison: From Looking for Similarities To Looking for Differences
Copyright © 2004 by Limsoon Wong
80
Single Nucleotide Polymorphism
SNP occurs when a single nucleotide replaces one of the other three nucleotide letters E.g., the alteration of the DNA segment AAGGTTA to ATGGTTA SNPs occur in human population > 1% of the time Most SNPs are found outside of "coding seqs” (Exercise: Why?) SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein Copyright © 2004 by Limsoon Wong
81
Example SNP Report Copyright © 2004 by Limsoon Wong
82
SNP Uses normal disease strong assoc weak assoc ½
Association studies Analyze DNA of group affected by disease for their SNP patterns Compare to patterns obtained from group unaffected by disease Detect diff betw SNP patterns of the two Find pattern most likely associated with disease-causing gene Copyright © 2004 by Limsoon Wong
83
SNP Uses better evaluate role non-genetic factors (e.g., behavior, diet, lifestyle) determine why people differ in abilities to absorb or clear a drug determine why an individual experiences side effect of a drug Exercise: What is the general procedure for using SNPs in these 3 types of analysis? Copyright © 2004 by Limsoon Wong
84
Application of Sequence Comparison: The 7 Daughters of Eve
Copyright © 2004 by Limsoon Wong
85
Population Tree Estimate order in which “populations” evolved
Based on assimilated freq of many different genes But … is human evolution a succession of population fissions? Is there such thing as a proto-Anglo-Italian population which split, never to meet again, and became inhabitants of England and Italy? Time since split Australian Papuan Polynesian Indonesian Cherokee Navajo Japanese Tibetan English Italian Ethiopian Mbuti Pygmy Africa Europe Asia America Oceania Austalasia Root Copyright © 2004 by Limsoon Wong
86
Evolution Tree Leaves and nodes are individual persons---real people, not hypothetical concept like “proto-population” Lines drawn to reflect genetic differences between them in one special gene called mitochondrial DNA 150000 years ago 100000 50000 present African Asian Papuan European Root Copyright © 2004 by Limsoon Wong
87
Why Mitochondrial DNA Present in abundance in bone fossils
Inherited only from mother Sufficient to look at the 500bp control region Accumulate more neutral mutations than nuclear DNA Accumulate mutations at the “right” rate, about 1 every 10,000 years No recombination, not shuffled at each generation Copyright © 2004 by Limsoon Wong
88
Mutation Rates All pet golden hamsters in the world descend from a single female caught in 1930 in Syria Golden hamsters “manage” ~4 generations a year :-) So >250 hamster generations since 1930 Mitochondrial control regions of 35 (independent) golden hamsters were sequenced and compared No mutation was found Mitochondrial control region mutates at the “right” rate Copyright © 2004 by Limsoon Wong
89
Contamination Need to know if DNA extracted from old bones really from those bones, and not contaminated with modern human DNA Apply same procedure to old bones from animals, check if you see modern human DNA. If none, then procedure is OK Copyright © 2004 by Limsoon Wong
90
Origin of Polynesians Do they come from Asia or America?
189, 217, 247, 261 189, 217 189, 217, 261 Copyright © 2004 by Limsoon Wong Image credit: Sykes
91
Origin of Polynesians Common mitochondrial control seq from Rarotonga have variants at positions 189, 217, 247, 261. Less common ones have 189, 217, 261 Seq from Taiwan natives have variants 189, 217 Seq from regions in betw have variants 189, 217, 261. More 189, 217 closer to Taiwan. More 189, 217, 261 closer to Rarotonga 247 not found in America Polynesians came from Taiwan! Taiwan seq sometimes have extra mutations not found in other parts These are mutations that happened since Polynesians left Taiwan! Copyright © 2004 by Limsoon Wong
92
Neanderthal vs Cro Magnon
Are Europeans descended purely from Cro Magnons? Pure Neanderthals? Or mixed? Cro Magnon Neanderthal Copyright © 2004 by Limsoon Wong
93
Neanderthal vs Cro Magnon
Based on palaeontology, Neanderthal & Cro Magnon last shared an ancestor 250k years ago Mitochondrial control regions accumulate 1 mutation per 10k years If Europeans have mixed ancestry, mito-chondrial control regions betw 2 Europeans should have ~25 diff w/ high probability The number of diff betw Welsh is ~3, & at most 8. When compared w/ other Europeans, 14 diff at most Ancestor either 100% Neanderthal or 100% Cro Magnon Mitochondrial control seq from Neanderthal have 26 diff from Europeans Ancestor must be 100% Cro Magnon Copyright © 2004 by Limsoon Wong
94
Clan Mother Clan mother is the most recent maternal ancestor common to all members A woman w/ only sons cant be clan mother---her mitochondrial DNA cant be passed on A woman cant be clan mother if she has only 1 daughter---she is not most recent maternal ancestor Exercise: Which of , , , is the clan mother? Copyright © 2004 by Limsoon Wong
95
How many clans in Europe?
Cluster seq according to mutations Each cluster thus represents a major clan European seq cluster into 7 major clans The 7 clusters age betw and years (length of time taken for all mutations in a cluster to arise from a single founder seq) The founder seq carried by just 1 woman in each case---the clan mother Note that the clan mother did not need to be alone. There could be other women, it was just that their descendants eventually died out Exercise: How about clan father? Copyright © 2004 by Limsoon Wong
96
World Clans Copyright © 2004 by Limsoon Wong
97
More Advanced Sequence Comparison Methods
Copyright © 2004 by Limsoon Wong PHI-BLAST Iterated BLAST PSI-BLAST SAM
98
PHI-BLAST: Pattern-Hit Initiated BLAST
Input protein sequence and pattern of interest that it contains Output protein sequences containing the pattern and have good alignment surrounding the pattern Impact able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional one-pass methods Copyright © 2004 by Limsoon Wong
99
PHI-BLAST: How it works
find from database all seq containing given pattern find sequences with good flanking alignment Copyright © 2004 by Limsoon Wong
100
PHI-BLAST: IMPACT Copyright © 2004 by Limsoon Wong
101
ISS: Intermediate Sequence Search
Two homologous seqs, which have diverged beyond the point where their homology can be recognized by a simple direct comparison, can be related through a third sequence that is suitably intermediate between the two High score betw A & C, and betw B & C, imply A & B are related even though their own match score is low Copyright © 2004 by Limsoon Wong
102
ISS: Search Procedure Input BLAST against db seq A (p-value @ 0.081)
Matched seqs M1, M2, ... Matched regions R1, R2, ... Results H1, H2, ... BLAST against db ) Keep regions in M1, M2, … that A. Discard rest of M1, M2, ... ) Copyright © 2004 by Limsoon Wong
103
No obvious match between Amicyanin and Ascorbate Oxidase
ISS: IMPACT No obvious match between Amicyanin and Ascorbate Oxidase Copyright © 2004 by Limsoon Wong
104
ISS: IMPACT Convincing homology via Plastocyanin Previously only
this part was matched Copyright © 2004 by Limsoon Wong
105
PSI-BLAST: Position-Specific Iterated BLAST
given a query seq, initial set of homologs is collected from db using GAP-BLAST weighted multiple alignment is made from query seq and homologs scoring better than threshold position-specific score matrix is constructed from this alignment matrix is used to search db for new homologs new homologs with good score are used to construct new position-specific score matrix iterate the search until no new homologs found, or until specified limit is reached Copyright © 2004 by Limsoon Wong
106
SAM-T98 HMM Method similar to PSI-BLAST
but use HMM instead of position-specific score matrix Copyright © 2004 by Limsoon Wong
107
Comparisons Iterated seq. comparisons vs pairwise seq. comparison
Copyright © 2004 by Limsoon Wong
108
Suggested Readings Copyright © 2004 by Limsoon Wong
109
References S.E.Brenner. “Errors in genome annotation”, TIG, 15: , 1999 T.F.Smith & X.Zhang. “The challenges of genome sequence annotation or `The devil is in the details’”, Nature Biotech, 15: , 1997 D. Devos & A.Valencia. “Intrinsic errors in genome annotation”, TIG, 17: , 2001. K.L.Lim et al. “Interconversion of kinetic identities of the tandem catalytic domains of receptor-like protein tyrosine phosphatase PTP-alpha by two point mutations is synergist and substrate dependent”, JBC, 273: , 1998. Copyright © 2004 by Limsoon Wong
110
References J. Park et al. “Sequence comparisons using multiple sequences detect three times as many remote homologs as pairwise methods”, JMB, 284(4): , 1998 J. Park et al. “Intermediate sequences increase the detection of homology between sequences”, JMB, 273: , 1997 Z. Zhang et al. “Protein sequence similarity searches using patterns as seeds”, NAR, 26(17): , 1996 M.S.Gelfand et al. “Gene recognition via spliced sequence alignment”, PNAS, 93: , 1996 B. Ma et al. “PatternHunter: Faster and more sensitive homology search”, Bioinformatics, 18: , 2002 M. Li et al. “PatternHunter II: Highly sensitive and fast homology search”, GIW, , 2003 Copyright © 2004 by Limsoon Wong
111
References S.F.Altshcul et al. “Basic local alignment search tool”, JMB, 215: , 1990 S.F.Altschul et al. “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs”, NAR, 25(17): , 1997. M. Pellegrini et al. “Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles”, PNAS, 96: , 1999 J. Wu et al. “Identification of functional links between genes using phylogenetic profiles”, Bioinformatics, 19: , 2003 L.J.Jensen et al. “Prediction of human protein function from post-translational modifications and localization features”, JMB, 319: , 2002 B. Sykes. “The seven daughters of Eve”, Gorgi Books, 2002 Copyright © 2004 by Limsoon Wong
112
References P.Erdos & A. Renyi. “On a new law of large numbers”, J. Anal. Math., 22: , 1970 R. Arratia & M. S. Waterman. “Critical phenomena in sequence matching”, Ann. Prob., 13: , 1985 R. Arratia, P. Morris, & M. S. Waterman. “Stochastic scrabble: Large deviations for sequences with scores”, J. Appl. Prob., 25: , 1988 R. Arratia, L. Gordon. “Tutorial on large deviations for the binomial distribution”, Bull. Math. Biol., 51: , 1989 Copyright © 2004 by Limsoon Wong
113
Additional materials for the lecture, if we have time...
Copyright © 2004 by Limsoon Wong
114
Understanding P-value
Copyright © 2004 by Limsoon Wong
115
Copyright © 2004 by Limsoon Wong
What is P-value? What does E-value mean? Statistical notion of P-value Prob that a random seq gives an equally good alignment How do we calculate it? Copyright © 2004 by Limsoon Wong
116
Copyright © 2004 by Limsoon Wong
Hypothesis Testing Null hypothesis H0 A claim (about a probability distribution) that we are interested in rejecting or refuting Alt hypothesis H1 The contrasting hypothesis that must be true if H0 is rejected Type I error H0 is wrongly rejected Type II error H1 is wrongly rejected Level of significance Probability of getting a type I error rejection region Acceptance Copyright © 2004 by Limsoon Wong
117
Description Level, aka P-value
Instead of fixing the significance level at a value a, we may be interested in computing the probability of getting a result as extreme as, or more extreme than, the observed result under H0 The description level of a test H0 is the smallest level of significance a at which the observed test result would be declared significant---that is, would be declared indicative of rejection of H0 Copyright © 2004 by Limsoon Wong
118
P-value: Key Questions
Recall S(A) scores an alignment A Let H(U,V) = max{ S(A) | A is an alignment of U and V} Suppose the letters in U and V are iid. Can we calculate h = E(H(U,V))? Furthermore, can we calculate P(H(U,V) > h + c)? Copyright © 2004 by Limsoon Wong
119
Alignment: Statistical Understanding
Ignoring indels for now, we can think of a good alignment as one that has a long contiguous stretch of matches The matches are essentially a long run of “heads’’ in a series of coin tosses So we think in terms of “a headrun of length t begins at position i ” Copyright © 2004 by Limsoon Wong
120
E(H(U,V)): Erdos-Renyi Thm for Exact Match
Copyright © 2004 by Limsoon Wong
121
E(H(U,V)): Arratia-Waterman Thm for Local Alignment
Copyright © 2004 by Limsoon Wong
122
Alignment: Statistical Understanding
Recall we think in terms of “a headrun of length t begins at position i ” Caution: headruns occur in “clumps” if there is a headrun of length t at position i, then with high prob there is also a headrun of length t at position i+1 So, we count only 1st headrun in a clump; i.e., a headrun preceded by a tail Copyright © 2004 by Limsoon Wong
123
Arratia-Gordon Thm on Large Deviations for binomials
Copyright © 2004 by Limsoon Wong
124
E(H(U,V)): Exact Match, Accounting for Clumps
Copyright © 2004 by Limsoon Wong
125
E(H(UV)): Local Alignment, Accounting for Clumps
Copyright © 2004 by Limsoon Wong
126
P(H(U,V) > E(H(U,V))): Approximate P-Value
Our E(Yn) from previous slides are in the right form :-) Copyright © 2004 by Limsoon Wong
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.