Protein sequence analysis is a key issue in post-genomic biology. High-throughput genome sequencing and assembly techniques, structural proteomics and.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in.
Heuristic alignment algorithms and cost matrices
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Multiple sequence alignments and motif discovery Tutorial 5.
Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire de BioInformatique et Génomique Intégratives du Département.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Construction of Substitution Matrices
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Bioinformatics and Computational Biology
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Basics of Comparative Genomics
Pipelines for Computational Analysis (Bioinformatics)
Genomic Data Manipulation
Sequence Based Analysis Tutorial
Basics of Comparative Genomics
Presentation transcript:

Protein sequence analysis is a key issue in post-genomic biology. High-throughput genome sequencing and assembly techniques, structural proteomics and gene expression analysis have led to a rapid increase in the amount of sequence, structure and functional data available in the public databases. In order to fully understand the biological role of a particular protein, such diverse information as cellular location, 2D/3D structures, mutations and their associated illnesses, the evolutionary context and literature references must be retrieved, validated, classified and made available to the biologist. The integration of the protein in the context of the complete family is an essential first step in the analysis process. As a consequence, a new generation of protein family analysis tools is now required to organise this heterogeneous, often predicted data into a structured, hierarchical network of connected information. PipeAlign is a protein family analysis tool integrating a five step process ranging from the search for sequence homologues in protein and 3D structure databases to the definition of the hierarchical relationships within and between subfamilies. The complete, automatic pipeline takes a single sequence or a set of sequences as input and constructs a high-quality, validated MACS (multiple alignment of complete sequences) in which sequences are clustered into potential functional subgroups. Three main criteria were essential in the PipeAlign development process : accuracy, quality and efficiency. As a consequence, all subsequent steps described below were extensively tested on large scale datasets, such as genomes, Benchmark database, or dedicated databases. PipeAlign : A new tool for pretein sequence analysis Ballast is a program designed to retrieve sequences very distantly related to the query sequence. In the BlastP results, they are given a very high expect value and are therefore ranked very low. Ballast takes the reference sequence (query) and searches the Protein sequence database (SwissProt + SpTrEMBL + PDB) for homologues using BlastP. It then uses the Blast results to create a conservation profile by stacking the Blast homologous pair alignments (HSPs). Ballast extracts from this profile small conserved segments (LMSs) that may characterise the family of the query. These segments are used to rescore Blast results. DbClustal is a new version of the popular ClustalW program. This version is able to incorporate information on local matching segment pairs of sequence retrieved from a database search to obtain a multiple sequence alignment. Such information overcomes the problem of long insertions/extensions which often lead to wrong alignments using global alignment algorithm- based programs. The LMS obtained from Ballast are used in DbClustal as anchors to seed the alignment process. Anchors are not incorporated as constraints but restraints, an additional weight, depending on the frequency of the observed LMS is added in the scoring scheme when aligning two sequences. Fig.1. a) The Ungapped Segment Pairs (USPs) issued from alignments with an Expect value less than 0.1 are stacked and used to build a profile of amino-acid conservation along the query sequence. The contribution of each USP to the profile is proportional to the significance of its corresponding HSP. b) LMS’s can be considered as informative signal within a noisy background. A first step of noise reduction is taken by smoothing the raw conservation profile. Peaks are expected to indicate the presence of segments that are significantly more conserved than their background. c) In order to identify peaks, the second derivative of the smoothed profile is computed. Informative peaks corresponding to LMS’s were empirically determined to be those where the second derivative is less than –0.1. The LMS’s are presented here as boxes over the second derivative plot. d) For each database sequence return by Blast, a dynamic programming algorithm determines the optimal alignment of the USPs to the predicted LMS’s in the query sequence. The score for each USP depends upon the strength of the LMS (the area of the raw profile overlapping the LMS) and on how well the USP matches the LMS. The optimal alignment score is used to rank the sequences. This operation may be done with user selected LMS’s instead of all predicted LMS’s. Figure 2. Flowchart showing the four major steps of the LEON algorithm. The input to the algorithm consists of a multiple sequence alignment, in which the user has identified a reference or ‘query’ sequence. The final result is a multiple alignment in which the sequences predicted to be homologous are ranked according to their similarity to the query. Non-homologous sequences are excluded from the alignment. Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O, Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC, Thompson JD, Wicker N, Poch O. RASCAL tries then to correct potential errors remaining in the alignment. Indeed, building a multiple sequence alignment is based on a heuristic, and even if incorporation of prior information in the first steps of the alignment process helps the progressive alignment algorithm, some initial errors may remain. To find badly aligned regions, the alignment is divided horizontally by clustering the sequences into subfamilies using Secator, a program based on a phylogenetic guide tree, and vertically using a sliding window in which a MD score is computed, for the whole alignment and for each subfamily. This identifies global and family-specific ‘core block’ regions corresponding to well- aligned zones in the alignment. A refinement strategy is then used to realigned inter core block regions, starting inside each family, then in a two-by-two family step, and finally between the global core blocks. LEON has been designed to detect and remove sequences non-homologous to the query that may be present in the alignment. As multiple sequence alignments are nowadays starting points for structural prediction, functional assignment, phylogenic studies, it is crucial to validate the alignment in terms of homology coherence. Errors may have been introduced by local matches with high homology although the sequences are in fact unrelated along their whole length. Clustering the sequences in validated multiple sequence alignment will determine the hierarchical relationships existing between sequences. Subfamilies are created to which the biologist will give a particular sense in the context of his studies. Two programs are available for clustering, Secator and DPC. As they are based on two different algorithms, their results may be different, leading to different interpretations. References :  Lecompte,O., Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene, 270, 17–30.  Plewniak,F., Thompson,J.D. and Poch,O. (2000) Ballast: blast post-processing based on locally conserved segments. Bioinformatics, 9, 750–759.  Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 15, 2919–2926.  Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2003) RASCAL: Rapid scanning and correction of multiple sequence alignment programs.Bioinformatics, 19,  Thompson,J.D., Plewniak,F., Ripp,R., Thierry,J.C. and Poch,O. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol., 4, 937–951.  Wicker,N., Perrin,G.R., Thierry,J.C. and Poch,O. (2001) Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol., 8, 1435–1441.  Thompson, J.D., Prigent, V., Poch, O. (2004) LEON: multiple aLignment Evaluation Of Neighbours. Nucleic Acids Res. 32,  Wicker N, Dembele D, Raffelsberger W, Poch O. (2002) Density of points clustering, application to transcriptomic data analysis.Nucleic Acids Res., 30, Main Features of PipeAlign : - start with a single sequence, a group of unaligned sequences or a set of aligned sequences - each step of the PipeAlign can be accessed separately - parameters of each steps can be changed according to the query characteristics - an ID is given to each session, allowing the user to check results later -- every alignments can be edited using JalView -- final and intermediate results (Blast, Ballast, alignments, groups) can be saved in several formats. -- documentation on each program and help on parameters available on-line. Figure 3.Snapshots of the PipeAlign outputs