Presenter: Yang Ruan Indiana University Bloomington

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Scalable High Performance Dimension Reduction
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Trees Lecture 4
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
Molecular Evolution Revised 29/12/06
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Bioinformatics and Phylogenetic Analysis
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple alignment: heuristics
Multiple sequence alignment
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Chapter 5 Multiple Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Terminology of phylogenetic trees
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Yang Ruan PhD Candidate Computer Science Department Indiana University.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Printing: This poster is 48” wide by 36” high. It’s designed to be printed on a large-format printer. Customizing the Content: The placeholders in this.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetic Trees - Parsimony Tutorial #13
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Phylogenetic basis of systematics
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions  Introduction.
DACIDR for Gene Analysis
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
SPIDAL and Deterministic Annealing
Adaptive Interpolation of Multidimensional Scaling
CS 581 Tandy Warnow.
Towards High Performance Data Analytics with Java
Presentation transcript:

Presenter: Yang Ruan Indiana University Bloomington Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylogram Visualized in 3 Dimensions  Presenter: Yang Ruan Indiana University Bloomington

Outline Motivation Background Spherical Phylogram Construction Experiment Conclusions and Future Work

Motivation Existing phylogenetic tree visualization methods (computationally slow) show the tree and clustering results separately. We wanted to display the phylogenetic tree and the sequence clustering simultaneously How well do sequence clusters from a fast clustering algorithm match the phylogenetic tree for genetically diverse DNA sequences?

Background Pairwise Sequence Alignment Distance Calculation Multidimensional Scaling Interpolation DACIDR Traditional Phylogenetic Tree Construction

Pairwise Sequence Alignment (PWA) Finds an overlapping region of the given two sequences that has the highest similarity as computed by a score measure. Global Alignment: the overlap defined over the entire length of the two sequences. E.g. Needleman-Wunsch (NW). Local Alignment: the overlap defined over a portion of the two sequences. E.g. Smith-Waterman Gotoh (SWG). Each pair of sequence alignment computation is independent from each other.

Pairwise Sequence Alignment Distance Calculation Align Sequence and calculate. E.g. use Percentage Identity (PID) Sequence A: ACATCCTTAACAA - - ATTGC-ATC - AGT - CTA Sequence B: ACATCCTTAGC - - GAATT - - TATGAT - CACCA PID(A, B) = identical pairs / alignment length Sequence (FASTA) File Pairwise Sequence Alignment Dissimilarity Matrix

Multidimensional Scaling A set of techniques that reduce the dimensionality of a certain dataset into a target dimension (usually 2 or 3) Scaling by Majorizing a Complicated Function (SMACOF) algorithm. EM-like algorithm, could trapped to local optima Weighting function requires an order N matrix inversion Weighted Deterministic Annealing SMACOF (WDA-SMACOF) Use Deterministic Annealing technique to avoid local optima Use Conjugated Gradient to avoid matrix inversion for weighting function.

Interpolation MDS uses O(N2) memory, limitation for very large data. data is divided into two sets, in-sample set for MDS, out-of-sample set for interpolation. Majorizing Interpolative MDS (MI-MDS) Interpolation algorithm that assumes all weights equal one Weighted Deterministic Annealing MI-MDS (WDA-MI-MDS) Robust interpolation algorithm handles various weights … Out-of-sample points in-sample points

DACIDR Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Use Hadoop for parallel applications, and Twister (Harp) for iterative MapReduce applications >G4P2R5E01A49DL GTCGTTTAAAGCC… >G4P2R5E01CT7SS … >G0H13NN01AMLS2 Pairwise Clustering All-Pair Sequence Alignment Interpolation Visualization DACIDR Multidimensional Scaling Simplified Flow Chart of DACIDR Input FASTA file Output 3D result

Traditional Phylogenetic Tree Construction Multiple Sequence Alignment (MSA) Used for three or more sequences and is usually used in phylogenetic analysis. All sequences has to be aligned with all other sequences in each iteration. It has a higher computational cost compared to PWA. A popular tree construction tool: RAxML Reads from MSA result. A standard maximum likelihood method used to generate phylogenetic trees from a MSA.

Spherical Phylogram Construction Traditional Phylogenetic Tree Display Distance Calculation Sum of Branches Neighbor Joining Interpolative Joining

Phylogenetic Tree Display Show the inferred evolutionary relationships among various biological species by using diagrams. 2D/3D display, such as rectangular or circular phylogram. Preserves the proximity of children and their parent. Example of a 2D Cladogram Examples of a 2D Phylogram

Distance Calculation (1) Sum of Branches The distance between point C and E can be calculated by summing over branch(C, B), branch(B, A) and branch(A, E Distance between leaf node C and E shown in (3) is clearly not equal to branch(B, C) + branch(B, D). The result will have a high bias because different distances were used for leaf nodes. (1) The cladogram of a tree with 5 nodes (2) The leaf nodes of the tree in 2D space after dimension reduction (3) The tree in 2D space after interpolation of the internal nodes

Distance Calculation (2) Neighbor Joining Select a pair of existing nodes a and b, and find a new node c, all other existing nodes are denoted as k, and there are a total of r existing nodes. New node c has distance: The existing nodes are in-sample points in 3D, and the new node is an out-of-sample point, thus can be interpolated into 3D space. (1) (2) (3)

Interpolative Joining Spherical Phylogram For each pair of leaf nodes, compute the distance their parent to them and the distances of their parent to all other existing nodes. Interpolate the parent into the 3D plot by using that distance. Remove two leaf nodes from leaf nodes set and make the newly interpolated point an in-sample point. Tree determined by Existing tree, e.g. From RAxML Generate tree, i.e. neighbor joining Spherical Phylogram Examples

Experiments Environment Dataset Construct Spherical Phylogram Construct Phylogenetic Tree Dimension Reduction using DACIDR Visualization Result MSA vs PWA WDA-SMACOF vs Other MDS methods

Environment Running Environment Parallel Runtimes Applications Quarry Cluster at Indiana University Xray Cluster of FutureGrid Parallel Runtimes Hadoop, Twister, MPI Applications DACIDR RAxML

Dataset DNA sequences from genetically diverse arbuscular mycorrhizal (AM) fungi were selected from three sources to include as much of the known genetic variation as possible: Sequences from the most comprehensive AM fungal phylogenetic tree to date (Kruger et al 2011) Sequences supplemented with well-characterized GenBank sequences to expand the range of genetic variation Representative sequences selected from clustering over 446k AM fungal sequences from spores using DACIDR Two datasets (599nts and 999nts) with different trim lengths 599nts shorter than 999nts 599nts includes representative sequences clustered with DACIDR Start 999 nts 599 nts

Construct Spherical Phylogram (1) Phylogenetic Tree Generation MSA is done by using MAFFT Fix the existing alignment from Kruger et al Align GenBank and DACIDR-clustered sequences to the alignment from Kruger et al Created a maximum likelihood unrooted phylogenetic tree with RAxML 100 iterations General time reversible (GTR) nucleotide substitution model with gamma rate heterogeneity (GTRGAMMA).

Construct Spherical Phylogram (2) MDS Visualization Use simplified DACIDR to generate the plot in 3D Distance Calculation from MSA, SWG, NW. MSA Dissimilarity Matrix 3D plot SWG MDS NW

Construct Spherical Phylogram (3) RAxML result visualized in FigTree. Spherical Phylogram visualized in PlotViz

Correlation of distance values between PWA and MSA Distance values for MSA, SWG and NW used in DACIDR were compared to baseline RAxML pairwise distance values Higher correlations from Mantel test better match RAxML distances. All correlations statistically significant (p < 0.001) The comparison using Mantel between distances generated by three sequence alignment methods and RAxML

MDS methods Sum of branch lengths will be lower if a better dimension reduction method is used. WDA-SMACOF finds global optima Sum of branch lengths of the SP generated in 3D space on 599nts dataset optimized with 454 sequences and 999nts dataset

Conclusions and Future Work Spherical Phylograms give an efficient way of displaying phylogenetic tree and clustering result together. For sequence analysis where datasets are large, the clustering could be used instead of phylogenetic analysis since it is much faster yet still gives reliable results. Future improvements Instead of just displaying the representative or consensus sequences from each cluster found from the original input dataset, it is possible to display the tree with entire dataset in the 3D space with the help of IJ. The interpolation algorithm used in DACIDR could also be improved to help identify the sequences that are poorly defined. Determine the phylogenetic tree without using RAxML but instead using a similar method on the distances generated after dimension reduction.

Questions? Yang Ruan (yangruan@indiana.edu) Geoffrey House (glhouse@indiana.edu) Geoffrey Fox (gcf@indiana.edu)

Whole pipeline

Why Local Optima Matters Spherical Phylogram using different dimension reduction methods Edge Sum Sum over all the length of edges Local Optima (examples) FR750020_Arc_Sch_K FR750022_Arc_Sch_K Original distances from FR750020_Arc_Sch_K and FR750022_Arc_Sch_K to all other 832 points. Add new data