Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW

Slides:

Advertisements

Similar presentations

Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.

Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

Measuring the degree of similarity: PAM and blosum Matrix

Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Structural bioinformatics

Protein Secondary Structures

1-month Practical Course Genome Analysis Protein Structure-Function Relationships Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 9 Database searching (3)

Sequence analysis course Lecture 8 Sequence databank searching 1.

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 10 Database searching Issues (1)

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Introduction to bioinformatics

Protein Modules An Introduction to Bioinformatics.

Similar Sequence Similar Function Charles Yan Spring 2006.

Identification of Domains using Structural Data Niranjan Nagarajan Department of Computer Science Cornell University.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Protein Bioinformatics Course

Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.

Protein Sequence Alignment and Database Searching.

Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Predicting domain features from sequence Bioinformatics Data Analysis and Tools.

Secondary structure prediction

BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

Protein Secondary Structure Prediction G P S Raghava.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Domains, their prediction and domain databases Lecture 16: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Protein domains, function and associated prediction Lecture 14: Introduction to Bioinformatics C E N T R F O R I N T E G R A T I V E B I O I N F O R M.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.

Sequence database searching – Homology searching Dynamic Programming (DP) too slow for repeated database searches. Therefore fast heuristic methods: FASTA.

Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Sequence similarity, BLAST alignments & multiple sequence alignments

Protein structure prediction.

SnapDRAGON: protein 3D prediction-based

1-month Practical Course Genome Analysis Iterative homology searching

Presentation transcript:

Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW

Protein domain delineation based on consistency of multiple ab initio model tertiary structures (SnapDRAGON) Protein domain delineation based on combining homology searching with domain prediction (Domaination) Protein Domain delineation

SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, Integrating protein multiple alignment, secondary and tertiary structure prediction to predict structural domains in sequence data

A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers.

Delineating domains is essential for: Obtaining high resolution structures (x-ray, NMR) Sequence analysis Multiple sequence alignment methods Prediction algorithms (SS, Class, secondary/tertiary structure) Fold recognition and threading Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method) Structural/functional genomics Cross genome comparative analysis

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty…

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Distance Regularisation Algorithm for Geometry OptimisatioN (Aszodi & Taylor, 1994) Domain prediction using DRAGON Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together. First constructs a random high dimensional C  distance matrix. Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

The DRAGON target matrix is inferred from: A multiple sequence alignment of a protein (old) –Conserved hydrophobicity Secondary structure information (SnapDRAGON) –predicted by PREDATOR (Frishman & Argos, 1996). –strands are entered as distance constraints from the N- terminal C  to the C-terminal C 

The C  distance matrix is divided into smaller clusters. Seperately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data

SnapDragon Generated folds by Dragon Boundary recognition Summed and Smoothed Boundaries CCHHHCCEEE Multiple alignment Predicted secondary structure

Domains in structures assigned using method by Taylor (1997) Domain boundary positions of each model against sequence Summed and Smoothed Boundaries (Biased window protocol) SnapDRAGON

Prediction assessment Test set of 414 multiple alignments;183 single and 231 multiple domain proteins. Sequence searches using PSI-BLAST (Altschul et al., 1997) followed by redundancy filtering using OBSTRUCT (Heringa et al.,1992) and alignment by PRALINE (Heringa, 1999) Boundary predictions are compared to the region of the protein connecting two domains (min  10 residues)

Average prediction results per protein Coverage is the % linkers predicted (TP/TP+FN) Success is the % of correct predictions made (TP/TP+FP)

SnapDRAGON Is very slow (can be hours for proteins>400 aa) – cluster computing implementation Uses consistency in the absence of standard of truth Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences SnapDRAGON webserver is underway

DOMAINATION Richard A. George Protein domain identification and improved sequence searching using PSI-BLAST (George & Heringa, Prot. Struct. Func. Genet., in press; 2002) Integrating protein sequence database searching and on-the-fly domain recognition

Domaination Current iterative homology search methods do not take into account that: –Domains may have different ‘rates of evolution’. –Common conserved domains, such as the tyrosine kinase domain, can obscure weak but relevant matches to other domain types –Premature convergence (false negatives) –Matrix migration / Profile wander (false positives).

PSI-BLAST Query sequence is first scanned for the presence of so- called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment. Initially operates on a single query sequence by performing a gapped BLAST search Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment. Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges

PSI-BLAST iteration Q ACD..YACD..Y Pi Px Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y Pi Px PSSM Database hits xxxxxxxxxxxxxxxxx

DOMAINATION Chop and Join Domains

Post-processing low complexity Remove local fragments with > 15% LC

Identifying domain boundaries Sum N- and C-termini of gapped local alignments True N- and C- termini are counted twice (within 10 residues) Boundaries are smoothed using two windows (15 residues long) Combine scores using biased protocol: if Ni x Ci = 0 then Si = Ni+Ci else Si = Ni+Ci +(NixCi)/(Ni+Ci)

Identifying domain deletions Deletions in the query (or insertion in the DB sequences) are identified by –two adjacent segments in the query align to the same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini) DB Query

Identifying domain permutations A domain shuffling event is declared –when two local alignments (>35 residues) within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order. DB Query b a a b

Identifying continuous and discontinuous domains Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain. An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered as discontinuous domains and joined.

Create domain profiles A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq). A multiple sequence alignment is generated using PRALINE (Heringa 1999, 2002; Kleinjung et al., 2002). Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997). The whole process is iterated until no new domains are identified.

Domain boundary prediction accuracy Set of 452 multidomain proteins 56% of proteins were correctly predicted to have more than one domain 42% of predictions are within  20 residues of a true boundary 49.9% (  44.6%) correct boundary predictions per protein

23.3% of all linkers found in 452 multidomain proteins. Not a surprise since: –Structural domain boundaries will not always coincide with sequence domain boundaries –Proteins must have some domain shuffling For discontinuous proteins 34.2% of linkers were identified 30% of discontinuous domains were successfully joined

Change in domain prediction accuracy using various PSI-BLAST E-value cut-offs

Benchmarking versus PSI-BLAST A set 452 non-homologous multidomain protein structures. Each protein was delineated into its structural domains. Database searches of the individual domains were used as a standard of truth. We then tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.

Two sets based on individual domain searches: Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query. Reference set 2: consists of database sequences found by searching with one or more of the domain sequences Therefore set 2 contains many more sequences than set 1 Ref set 1 Ref set 2 Query DB seqs

Sequences found over Reference sets 1 and 2

Reference 1 PSI-BLAST finds 97.9% of sequences Domaination finds 99.1% of sequences Reference 2 PSI-BLAST finds 83.2% of sequences Domaination finds 90.6% of sequences

Test against SMART sequence domains A set of 15 sequences with domain definition in the SMART database (Ponting et al. 1999) Create two reference sets based on individual domain searches.

Sequences found over Reference sets 1 and 2 from 15 Smart sequences

SSEARCH significance test Verify the statistical significance of database sequences found by relating them to the original query sequence. SSEARCH (Pearson & Lipman 1988). Calculates an E-value for each generated local alignment. This filter will lose distant homologies. Use the 452 proteins with known structure.

Significant sequences found in database searches At an E-value cut-off of 0.1 the performance of DOMAINATION searches with the full-length proteins is 15% better than PSI-BLAST

Scooby-domain: prediction of globular domains in protein sequence Richard A. George 1,2, Kuang Lin 3 and *Jaap Heringa 4 1 Inpharmatica Ltd, 60 Charlotte Street, London W1T 2NU UK 2 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK 3 Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill NW7 1AA, UK 4 Centre for Integrative Bioinformatics (IBIVU), Faculty of Sciences and Faculty of Earth and Life Sciences, Vrije Universiteit, De Boelelaan 1081a, 1081HV Amsterdam, The Netherlands * Corresponding author

Generating a domain probability matrix for a query sequence Scooby-domain uses a multilevel smoothing window to predict the location of domains in a query sequence. A window size, representing the length of a putative domain, is incremented starting from the smallest domain size observed in the database to the largest domain size. Based on the window length and its average hydrophobicity, the probability that it can fold into a domain is found directly from the distribution of domain size and hydrophobicity, calculated using S- level domain representatives from the CATH domain database (12). For each domain, percentage hydrophobicity is calculated using a binary hydrophobicity scale, where 11 amino acid types are considered as hydrophobic: Ala, Cys, Phe, Gly, Ile, Leu, Met, Pro, Val, Trp and Tyr (6). Visualisation of the Scooby-domain probability matrix for a sequence can be used to effectively identify regions that are likely to fold into domains or are likely to be unstructured.

Automatic domain boundary assignment The Scooby-domain web server performs fast, automatic, domain annotation by identifying the most domain-like regions in the query sequence. The highest probability in the domain probability matrix represents the first predicted domain. The corresponding stretch of sequence for this domain is removed from the sequence. Therefore, the first predicted domain will always have a continuous sequence and further domain predictions can encompass discontinuous domains. If the excised domain is at a central position in the sequence, the resulting N- and C-termini fragments are rejoined and the probability matrix recalculated as before. The second highest probability is then found and the corresponding sub-sequence removed.

Table 1. No weightingN- and C-termini weighting MethodSensitivityAccuracySensitivityAccuracy FirstScoobyDo Domainati on Linker Class BestScoobyDo Domainati on Linker Class Two measures are used to score predictions: percentage of real boundaries predicted (sensitivity) and percentage of correct predictions made (accuracy). ‘N- and C-termini weighting’ are predictions made with increased probability of domain boundaries at the ends of the protein sequences. ‘Domaination’ are results for ScoobyDo predictions made with added information from Domaination. ‘Linker’ are results for ScoobyDo predictions made with added information from the interdomain linker propensities from the Linker database. ‘Class’ are ScoobyDo predictions made using three smoothing windows to separately predict all-α, all-β and α-β domains. ‘First’ is the highest probability prediction made. ‘Best’ is the best of ten predictions made.

Figure 1 (a) Histogram of CATH domains as a function of their hydrophobicity and domain length. The colour bar to the right of the figure shows the scale of the distribution (0 to 1): red areas represent regions that have a high frequency of domain occurrence. The second plot shows the average CATH domain hydrophobicity minus the average hydrophobicity for randomised sequences (generated from a random selection of residues from sequences in the CATH database). (b) Multilevel smoothing window. The horizontal axis corresponds to the sequence position, i, and the vertical axis represents the window length used in the smoothing of sequence hydrophobicity, j. Each position in the matrix corresponds to the average hydrophobicity assigned to the centre of a window during smoothing. (c) Each position in the matrix is converted to a probability that it will fold into a domain, based on the lengths and hydrophobicities observed in the distribution of CATH domains. (d) i. The highest scoring window (first predicted domain) is identified in the probability matrix and the sequence region it encapsulates (blue triangle) is removed from the sequence. ii. The resulting sequence fragments are rejoined and the probability matrix recalculated. iii. The smoothing windows that encapsulate the last 15 residues of the N-terminal fragment and the first 15 residues of the C-terminal fragment have their probabilities set to zero (white bands). If the next highest scoring region is found in the red region then the excised domain will be discontinuous, otherwise it will be continuous.