Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart

Introduction Goal is to produce an exercise that will engage allied health students and –Strengthen math skills and decrease math phobia –Decrease molecular data phobia –Increase bioinformatics literacy

Prerequisites The following will be presented to students prior to this project –Basic evolutionary concepts and use of 16S rRNA in determining relationships between prokaryotes –Introduction to Biology Workbench, BLAST and tree construction

Approach Use the theme of food poisoning to engage both nursing and nutrition student populations Utilize mathematics and bioinformatics tools

Approach Students will pick a week in which food poisoning is likely; Christmas, 4 th of July, Thanksgiving, etc. Students will – identify a source of food poisoning (ex. Salmonella), and check the Morbidity and Mortality Weekly Report tables for the number of cases in a specific state or region – calculate proportion of cases represented by that region –Answer “Is this number of cases unusual based on the data presented for this time period? How can you tell?”

Approach Students will then address the questions –“Without culturing the organism, how might you track it in humans or in a food supply?” –“What relationships (if any) exists between various strains of this organism”? –“Can this type of data be used to find the original strain?

Approach Students will –obtain sequence data from NCBI’s GenBank for the organism (or virus) of interest –BLAST the sequence to find organisms with related sequences –Collect 8-13 of the closest BLAST results to perform a global alignment, and construct a tree

Questions Students choose a time period (week), search MMWR (Morbidity and Mortality Weekly Report) for the number of cases of a particular disease for a given week. 1.Given the chosen disease, how many cases of the disease occurred in a particular state (or other locale) during the week?

More Questions about the Scene 2a. How many persons are involved? Is there an index case? 2b. What percent of the population has the disease? 3. What other question might you ask from these data? 4. What microbe causes the disease? What strain, if appropriate?

Now What? (Questions about the microbe) 5. If you want to determine the specific strain of the microbe, can you find the genetic sequence? 6.How has the strain evolved? 7.What is its phylogeny, and what are the closest neighbors?

And Then... (Questions to Investigate) 8a. Why is the answer to the previous question of interest to you if you are a nurse, a dietician, a parent, the mayor, the hospital director, the first responder, a restaurant owner, a cruise ship director, a public health inspector, or other interested person (you choose)? 8b. What other questions are of interest to you in this role?

Finding the Microbe Search MMWR Morbidity Tables http://www.cdc.gov/mmwr/distrnds.html

Choose a Week http://wonder.cdc.gov/mmwr/mmwrmorb.asp

Choose a Disease http://wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year= 2006&mmwr_week=07&mmwr_table=2F

What Percent of the Residents are Sick? http://wonder.cdc.gov/mmwr/mmwr_reps.asp?mmwr_year=2006 &mmwr_week=01&mmwr_table=2F

Find a Microbe Use your text, class notes, or other resources to determine the causative agent of the disease you have chosen. Choose a microbe, then find its family tree. For the Salmonellosis example, we have chosen Salmonella enterica, a microbe with many variants, called serovars.

Basics of Tree Construction Preliminary Exercises Goal –Students will practice with small examples before trying to construct a tree –Students will learn phylogenetics notation and terminology (also see Glossary at end)

From Sequences to Pairwise Alignment The Needleman-Wunsch Method

We make a table of residue scores, S(i,j). The number S(i,j) is computed by comparing residue i in sequence (1) with residue j in sequence (2), using previously chosen values for matches and mismatches. Each alignment matrix entry, H(i,j), gives the score of the best alignment of the first i residues in sequence (1) with the first j residues of sequence (2) We have one row for each residue in sequence (2) and one column for each residue in sequence (1). To get started, we add a 0th row and a 0 th column. The upper left corner is position (0,0). We set H(0,0) = 0. The rest of the values in the top row are (reading across) -g, -2g, -3g, etc., where g is the gap penalty. Similarly, the rest of the values in the leftmost column are (reading down) –g, -2g, - 3g, etc. To compute the value of H(i+1,j+1) we first consider the values north, west and northwest. We then find S(i+1,j+1) + the value immediately northwest (The value just north) – g (The value just west) – g

Distance Matrix Then we choose the largest of these three numbers to be H(i+1,j+1) and draw an arrow from position (i+1,j+1) to the position that gave us the value of H(i+1,j+1). Example: Let match = 1, mismatch = -1 and g = 2. Consider the sequences (1) G A A T T C (2) G G A T GAATTC 0-2-4-6-8-10-12 G-21 G-4 A-6 T-8

Try This Exercise (at home ok) a.Complete the table and then follow the arrows to determine the alignment : –A diagonal arrow corresponds to aligning the two letters. –A horizontal arrow corresponds to aligning a letter from (2) with a gap. –A vertical arrow corresponds to aligning a letter from (1) with a gap. –(Note that if you have ties, you may have more than one arrow, and so more than one “best” alignment.) b.Redo this exercise with your own choice of match, mismatch and gap values. Experiment with these values to obtain alignments different from the ones you got in part (a).

From Pairwise Alignment to Multiple Alignment Idea of global progressive alignment: Most alike sequences are aligned together in order of their similarity. A consensus is determined and then aligned to the next most similar sequence. The determination of “next most similar” is made using phylogenetic information (a guide tree).

From Alignment to Distance Matrix There are many different ways of computing the distance between pairs of sequences in multiple alignment. Each uses different assumptions, which may or may not be reasonable for a given situation. For example, the simplest model, Jukes-Cantor, assumes that mutation occurs at a constant rate, and that each nucleotide is equally likely to mutate into any other nucleotide (at that rate). For protein sequences, the calculation is (even) more complicated.

From distance matrix to tree Again, there are many different methods available. Biology Workbench uses ClustalW to construct multiple alignments. Clustal uses the neighbor joining methods to find the guide tree. The final tree produced by Workbench is a compilation of these guide trees.

Clustering Methods The UPGMA (Unweighted Pair-Group Methods with Arithmetic means) method + easy to describe; produces an ultrametric (and hence additive) tree - assumptions (molecular clock; all species evolve at the same rate) General idea: Step 1. Find the two closest taxa. Step 2. Treat the two closest as a new combined taxon, and make a new matrix, calculating distances from the combined taxon to the others using the average of all the pairwise distances involved. Iterate these two steps until the tree is completed.

ABCD A0975 B90810 C7808 D5 80 Construct the UPGMA tree for the following distance matrix: A/DBC 019/215/2 B 08 C 0 Observe: A and D are closest Now the A/D cluster and C are closest. Next, update the matrix

Exercises 1.Finish constructing this tree. 2.The tree is ultrametric, but the data are not. (Why not?) How would the data have to be changed in order that they be ultrametric? 3.The tree is additive. Are the data? Now, redo questions 1 – 3 in case the BD distance is 12 instead of 10.

Neighbor Joining (NJ) + additive (but not ultrametric); computationally efficient - unrooted. Prior knowledge is needed to decide how to root the tree. Note: the species which are closest according to the distance matrix need NOT be neighbors. That’s why we need a modified distance formula Exercise: Draw a picture of a tree on four taxa that illustrates the problem described in the note above.

Constructing a Neighbor Joining Tree Step 1: Find the two taxa which are closest using the modified distance formula below. Join them. To find the modified distance from node i to node j: Let N be the number of taxa. Let R_i = sum of all the distances from node i to all others except node j, divided by N – 2 Let R_j = sum of all the distances from node j to all others except node i, divided by N – 2 Let D(i,j) = matrix distance. Calculate modified distance, D*, from i to j as D*(i,j) = D(i,j) – R_i – R_j. For example, using the distance matrix we used earlier, D*(A,B) = 9 – 6 – 9 = -6.

NJ (continued) Step 2: Suppose that nodes i and j give the smallest value of D*. Start the tree by joining those nodes to a new node. Call the new node (ij). We now have two fewer taxa and one more internal node, for a net of one less node than we started with. Step 3: Now, as in the UPGMA method, we make a new matrix showing the distances to all the nodes except i and j. Problem: the new internal node (ij) is not in the original matrix. i j (ij)

This problem can be solved Step 4: To update the matrix, you will need to compute the distance from the new internal node (ij) to the remaining nodes. For each remaining node k, compute the new distance as ½ [D(i,k) + D(j,k) – D(i,j)] Step 5: Apply steps 1 – 4 to the revised matrix.

Exercises Practice the NJ method on the matrix we had earlier. Now try both methods using the matrix to the right. Why do you get different trees? ABCD A0172127 B1701218 C2112014 D2718140

Final Approach Use the theme of food poisoning to engage both nursing and nutrition student populations Utilize mathematics and bioinformatics tools

Find the Microbial Gene NCBI Search http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide

Choose a Strain http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=nucleotide&cm d=search&term=Salmonella+enterica+16s+ribosomal+RNA+gene

BLAST Basic Local Alignment Search Tool http://www.ncbi.nlm.nih.gov/BLAST/

Paste Sequence, BLAST off! http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&LAYOUT=TwoWindows&AUTO_FORMAT=Semiauto&ALIGNMENTS =50&ALIGNMENT_VIEW=Pairwise&CLIENT=web&DATABASE=nr&DESCRIPTIONS=100&ENTREZ_QUERY=%28none%29& EXPECT=10&FILTER=L&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&NCBI_GI=on&PAGE=Nucleotides&PROGR AM=blastn&SERVICE=plain&SET_DEFAULTS.x=34&SET_DEFAULTS.y=8&SHOW_OVERVIEW=on&END_OF_HTTPGET=Ye s&SHOW_LINKOUT=yes&GET_SEQUENCE=yes

BLAST Results

BLAST Sequences http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi

GenBank http://www.ncbi. nlm.nih.gov/entre z/viewr.fcgi?db= nucleotide&val= 88604678 http://www.ncbi. nlm.nih.gov/entre z/viewr.fcgi?db= nucleotide&val= 88604678

FASTA http://www.ncbi.nlm.nih.gov/entrez/viewer. fcgi?db=nucleotide&qty=1&c_start=1&list _uids=88604678&dopt=fasta&dispmax=5& sendto=&from=begin&to=end&extrafeatpr esent=1&ef_CDD=8&ef_MGC=16&ef_HP RD=32&ef_STS=64&ef_tRNA=128 http://www.ncbi.nlm.nih.gov/entrez/viewer. fcgi?db=nucleotide&qty=1&c_start=1&list _uids=88604678&dopt=fasta&dispmax=5& sendto=&from=begin&to=end&extrafeatpr esent=1&ef_CDD=8&ef_MGC=16&ef_HP RD=32&ef_STS=64&ef_tRNA=128

Constructing a Tree Add sequences http://seqtool.sdsc.edu/CGI/BW.cgi #!http://seqtool.sdsc.edu/CGI/BW.cgi #

Clustal W Choose the Multiple Sequence Alignment http://seqto ol.sdsc.edu/ CGI/BW.cg i#http://seqto ol.sdsc.edu/ CGI/BW.cg i#!

Choose a Tree Type Choose Rooted and/or Unrooted Submit http://seqto ol.sdsc.edu/ CGI/BW.cg i#http://seqto ol.sdsc.edu/ CGI/BW.cg i#!

Voila! Unrooted Tree http://seqtool.sdsc.ed u/CGI/BW.cgi#http://seqtool.sdsc.ed u/CGI/BW.cgi#!

Rooted Tree Which species are the most closely related? http://seqtool.sdsc.edu/CGI/BW.cgi#http://seqtool.sdsc.edu/CGI/BW.cgi#!

Final Questions How are the data helpful if you are a –Parent? –Restaurant owner? –Hospital director? –Public health inspector?

Assessment Student Learning Outcomes –More comfortable with computation –Using the tools to answer questions –Empowerment (we hope!)

References -- Texts Emphasis on algorithms: Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms Michael S. Waterman, Introduction to Computational Biology Bio/Math Balanced: Paul G. Higgs and Teresa K. Attwood, Bioinformatics and Molecular Evolution The Bible of Phylogenetics: Joseph Felsenstein, Inferring Phylogenies

References -- Websites http://mbi.ohio- state.edu/2005/tutorials2005.htmlhttp://mbi.ohio- state.edu/2005/tutorials2005.html (tutorial on tree construction) http://bioalgorithms.info/courses.php (list of links to bioinformatics course websites) http://tree-thinking.org/ (resources for learning and teaching)

Glossary (for the faint of heart) Taxon (plural taxa) or operational taxonomic unit (OTU) – an entity (such as a species, protein sequence, language, etc.) whose distance from or similarity to other entities can be measured. Phylogeny – the evolutionary history of some collection of taxa, i.e., tracking lineages as the taxa change through time. Phylogenetic tree – a graphic representation of a phylogeny.

More Glossary Matrix – a rectangular array of data Graph – a collection of nodes (aka vertices) (usually represented by dots) and edges (connected pairs of vertices, usually represented by line segments) Example:

Even More Glossary Connected graph -- In a connected graph, it is always possible to get from any node to any other node by following the edges. Here is an example of a graph that is not connected, since we can’t get from to

Glossary- are we there yet? Cycle -- a graph has a cycle if you can start at some node and, following the edges, get back to that node without backtracking. Here is a graph with a cycle marked in red.

Glossary – almost done Tree – a connected graph with no cycles Weighted tree – a tree whose edges are labelled to represent distances Additive tree – a tree where no matter what three nodes you choose, say A, B and C, the distance from A to B plus the distance from B to C is the same as the distance from A to C. Degree of a node (or valence) - the number of edges attached to a node Rooted tree – a tree where some node has been specially designated. (Usually we interpret the root to be the ancestral taxon.

The end of the Glossary Binary tree – if rooted the root has degree 2 and all others have degree 1 or 3. Internal nodes – nodes in a rooted tree of degree 3 Leaves – nodes in any tree of degree 1. Ultrametric tree – a tree is ultrametric if it meets the three point condition. Any three nodes determine three distances, AB, BC and AC. The three point condition says that the two largest of these three distances must be the same.

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Similar presentations

Presentation on theme: "Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart.

Similar presentations

Presentation on theme: "Taking the Bite (Byte?) Out of Phylogeny Jennifer Galovich Lucy Kluckhohn Jones Holly Pinkart."— Presentation transcript:

Similar presentations

About project

Feedback