Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Sequence Alignment (MSA) and Phylogeny. Clustal X.

Similar presentations


Presentation on theme: "Multiple Sequence Alignment (MSA) and Phylogeny. Clustal X."— Presentation transcript:

1 Multiple Sequence Alignment (MSA) and Phylogeny

2 Clustal X

3 Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL...

4 One of the options to get multiple sequence Fasta file

5

6 Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL...

7 Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQVRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEALISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQLQGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCLISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNRELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQVRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL...

8 Step1: Load the sequences

9 Sequences and conservation view

10 Step2: Perform Alignment

11 Sequences and conservation view

12

13 Step 3: Create tree

14 Step 4: NJPlot

15

16 The Newick tree format is used to represent trees as strings CA D In Newick format: ((A,C),(B,D)); B Each pair of parenthesis () enclose a clade in the tree, and the comma separates the members of the corresponding clade. “;” – is always the last character

17 How robust is our tree?

18  We need some statistical way to estimate the confidence in the tree topology  But we don’t know anything about the tree topology distribution or parameters  The only data source we have is our data (MSA)  So, we must rely on our own resources: “pull up by your own bootstraps” How robust is our tree?

19 Bootstrap (and jackknife)

20 Jackknife 1. We create n (typically 100-1000) new MSAs (pseudo-data sets) by randomly sampling half of the characters. (random samples without replacement) We do not change the number of sequences, just the number of positions! POS: 52316 1 : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: 18745 1 : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: 18394 1 : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG

21 Jackknife 2. We reconstruct a tree from each data set, using the same method used for reconstructing the original tree POS: 52316 1 : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: 18745 1 : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: 18394 1 : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4

22 3. For each node in our original tree, we count the number of times it appeared in the Jackknife analysis Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Back to Jackknife Sp1 Sp2 Sp3 Sp4 67% 100% In 67% of the data sets, the node SP1+SP2 was found

23 Bootstrap The same as jackknife, but instead of sampling K/2 positions, we sample K positions with replacement

24 Bootstrap 1. Resample K positions n times 12345 K 1 : ATCTG…A 2 : ATCTG…C 3 : ACTTA…C N : ACCTA…T 11244 K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578…K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C

25 Bootstrap 2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 11244 K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578…K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C

26 Bootstrap 3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 67% 100% The jackknife method is less general than bootstrap Jackknife explores the data differently Jackknife is easier to apply to complex sampling schemes

27 Step 3.5 - Bootstrap

28 Bootstrap values on NJPlot Note: ClustalX saves trees as.ph file trees with bootstrap are saved as.phb You might have to reopen the tree…


Download ppt "Multiple Sequence Alignment (MSA) and Phylogeny. Clustal X."

Similar presentations


Ads by Google