Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Slides:

Advertisements

Similar presentations

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*

Advertisements

COFFEE: an objective function for multiple sequence alignments

Molecular Evolution Revised 29/12/06

Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.

Structural bioinformatics

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.

Bioinformatics and Phylogenetic Analysis

Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Multiple sequence alignments and motif discovery Tutorial 5.

Protein Sequence Classification Using Neighbor-Joining Method

Multiple alignment: heuristics

Multiple sequence alignment

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.

Multiple Sequence Alignments

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Sequence comparison: Local alignment

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Developing Pairwise Sequence Alignment Algorithms

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Parallelized Multiple Sequence Alignment on the Public Cloud Presented by: Dr. G.Sudha Sadasivam Professor, Dept of CSE, PSG College of Technology, Coimbatore.

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.

Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.

Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Introduction to Phylogenetics

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Applied Bioinformatics Week 8 Jens Allmer. Theory I.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

DNA, RNA and protein are an alien language

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune

INTRODUCTION TO BIOINFORMATICS

Multiple sequence alignment (msa)

The ideal approach is simultaneous alignment and tree estimation.

Sequence comparison: Local alignment

Multiple Sequence Alignment

Phylogenetic Trees.

BNFO 602 Phylogenetics Usman Roshan.

Sequence Based Analysis Tutorial

Introduction to Bioinformatics

Basic Local Alignment Search Tool (BLAST)

Presentation transcript:

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi (07MW05) Guided by Dr. G. Sudha Sadasivam Asst. Professor Dept. of CSE

What is Sequence Alignment? The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences.

Types of Sequence Alignment  Pair-wise Alignment  Alignment of two sequences  Global –using Needleman Wunsch algorithm. L G P S S K Q T G K G S _ S R A W D N | | | | | | | L N _ A T K S A G K G A I M R L G D A  Local – using Smith Waterman algorithm. _ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _ | | | _ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _  Multiple Sequence Alignment  Alignment of more than two sequences

 Initialization F(0, 0) = 0 F(0, i) = −i * d F(j, 0) = −j* d  Main Iteration For each i=1…M and j=1….N F(i-1,j-1+s(x i,y j ), case 1 F(i,j) = max F(i-1,j)-d, case 2 F(I,j-1)-d, case 3 DIAG, if case 1 Ptr(i,j) = UP, if case 2 LEFT, if case 3 Case 1: x i aligns to y i Case 2: x i aligns to gap Case 3: y i aligns to gap NEEDLEMAN WUNSCH ALGORITHM s(x i,y j ) = +1, match -1, mismatch

Needleman Wunsch Algorithm AGTA A T A-3 02 F(i,j)‏ i= j= f(0,0)+s(1,1) =1 F(1,1)=max f(0,1)-1 = -2 f(1,0)-1 = -2‏ = 1 (case 1) OptimalAlignment A_TA AGTA Score: = 4 Case 1: x i aligns to y i Case 2: x i aligns to gap Case 3: y i aligns to gap s(x i,y j ) = +1, match -1, mismatch d=1 PTR = DIAG, if case 1 UP, if case 2 LEFT, if case 3 f(0,1)+s(1,2) =-2 f(0,2)-1 = -3 f(1,1)-1 = 0 Max = 0 (case 3)

Smith Waterman Algorithm Initialization: F(0, j) = F(i, 0) = 0 Iteration: 0 F(i, j) = max F(i – 1, j – 1) + s(x i, y j ), case 1 F(i – 1, j) – d, case 2 F(i, j – 1) – d, case 3

Smith Waterman Algorithm AGTA A T A F(i,j)‏ i= j= f(0,0)+s(1,1) =1 F(1,1)=max f(0,1)-1 = -1 f(1,0)-1 = -1 0 = 1 (case 1) OptimalAlignment A_TA _ _TA Score: 1+2 = 4 Case 1: x i aligns to y i Case 2: x i aligns to gap Case 3: y i aligns to gap s(x i,y j ) = +1, match -1,mismatch d=1 PTR = DIAG, if case 1 UP, if case 2 LEFT, if case 3 f(0,2)+s(1,3) =-1 F(1,3)=max f(0,3)-1 = -1 f(1,2)-1 = -1 0 = 0

Input: one query file and a set of sequence files Put all files in DFS Map Reduce Combine all the (K,V) pairs Output: (Filename, Score) Set File Name as Key Pass Entire File contents as Value Do Sequence alignment of query file with the target files in DFS Return (Filename as key, Score as Value). Proposed system

 A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA.  In general, the input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor.  From the resulting multiple sequence alignment, phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.

 Dynamic programming  Progressive alignment construction Methods for producing MSA

 most direct method for producing an MSA to identify the globally optimal alignment solution.  computational complexity ◦ For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment. ◦ The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length.

 uses a heuristic search.  builds up a final MSA by combining pair wise alignments beginning with the most similar pair and progressing to the most distantly related.  The most popular progressive alignment method has been the ClustalW.  All progressive alignment methods require two stages: ◦ a first stage in which the relationships between the sequences are represented as a tree, called a guide tree. ◦ second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree.

◦ first step: computation of guide tree from pair-wise alignment scores by an efficient clustering method such as neighbor-joining method. ◦ Second step: The two most similar sequences are aligned first, additional sequences (or groups of sequences) are added later following the guide tree ◦ requires a method to optimally align a sequence with an alignment or an alignment with an alignment sequence 1 sequence 2 sequence 3 Sequence4 Example: According to guide tree, align first sequences 1 and 2, then align sequence 3 to alignment of sequence 1 and 2, then sequence 4 to alignment of sequences 1, 2, and 3.

 Neighbor-joining is a bottom-up clustering method used for the construction of phylogenetic trees.  Neighbor-joining is an iterative algorithm. Each iteration consists of the following steps:  Based on the current distance matrix calculate the matrix Q.  For example, if we have four taxa (A, B, C, D) and the following distance matrix:

 We obtain the following values for the Q matrix:  Find the pair of taxa in Q with the lowest value. Create a node on the tree that joins these two taxa (i.e. join the closest neighbors, as the algorithm name implies).

 Calculate the distance of each of the taxa in the pair to this new node.  Calculate the distance of all taxa outside of this pair to the new node.  Start the algorithm again, considering the pair of joined neighbors as a single taxon and using the distances calculated in the previous step.

 The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result.  Performance is also particularly bad when all of the sequences in the set are rather distantly related.

Phylogenetic Analysis  An investigation of evolutionary relationships among a group of related sequences by producing a tree representation of relationships.  Significant use-to make prediction concerning tree of life.

Structure  outer branches ->Sequences  Inner part -> Reflect the degree to which sequences are related  Alike sequences -> located at neighboring outside branches  Less related sequences -> more distant from each other

Proposed System  Implementation of Sequence alignment and phylogenetic prediction using map-reduce programming model in hadoop  Algorithms used for Alignment  Global-Needleman Wunsch Algorithm  Local-Smith Waterman Algorithm

Input: set of sequence files Put all files in DFS Map Reduce Combine all the (K,V) pairs Output: (Filename, Score) Phylogenetic Analysis Set File Name as Key Pass Entire File contents as Value Do Sequence alignment of all the files with all possible combinations and find the alignment scores Return (Filename as key, Score as Value). Proposed system

 The mapreduce algorithm for pairwise sequence alignment both local and global was completed using the Needleman wunsch and Smith waterman algorithm in Hadoop.  This can be extended to do multiple sequence alignment and to perform phylogenetic analysis in Hadoop for predicting possible evolutionary relationships among a group of related sequences.

Bibliography David W. Mount, Bioinformatics Sequence and Genome Analysis, second edition Map reduce: Simplified data processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Biojava in Anger, A Tutorial and Recipe for Those in a Hurry

Thank you