1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Introduction to Bioinformatics

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.

Protein Fold recognition

Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Sequence alignment, E-value & Extreme value distribution

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.

BLAST Workshop Maya Schushan June 2009.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Tutorial 4 Substitution matrices and PSI-BLAST 1.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Effect of Neighboring Flight Patterns on a Particular Flight Presented by Venugopal Rajagopal CIS 595 Dr. Slobodan Vucetic.

Identifying property based sequence motifs in protein families and superfamies: application to DNase-1 related endonucleases Venkatarajan S. Mathura et.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Construction of Substitution matrices

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.

What is BLAST? Basic BLAST search What is BLAST?

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S

HIV Mutation Classifier HIV Mutation Classifier Hannah Bier’s Project Proposal.

DNA SEQUENCE ALIGNMENT FOR PROTEIN SIMILARITY ANALYSIS CARL EBERLE, DANIEL MARTINEZ, MENGDI TAO.

Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.

Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.

Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

What is BLAST? Basic BLAST search What is BLAST?

Basics of BLAST Basic BLAST Search - What is BLAST?

Bioinformatics and BLAST

The future of protein secondary structure prediction accuracy

Basic Local Alignment Search Tool

Protein structure prediction.

Alignment IV BLOSUM Matrices

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Sequence Analysis Alan Christoffels

Presentation transcript:

1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng

2 Protein Disorder Prediction What is protein? Protein is usually a chain of 20 different Amino Acids (AAs). So a protein can be represented by a string of 20 characters. Usually, protein has its 3D structure, which is important to its function What is disorder protein? Disorder protein is a protein that part or all of it have NO identified 3D structures. Can protein disorder be predicted? Predictor developed by Dr. Vucetic can predict protein disorder with accuracy of 82.6% Current dataset used to train disorder predictor 145 proteins with CONFIRMED long disordered region 130 proteins that are totally ordered

3 The Objective What are homologous/similar sequences? Proteins that may derive from same ancestor. They tend to have SIMILAR amino acids sequences Where to find homologous/similar sequences? For a given protein (its amino acids sequences), its homologous/similar sequences can be found using the NCBI BLAST Web server ( The hypothesis Homologous/similar sequences may have similar structures, or, similar disorder regions. So, we can use similar sequences to enhance the training set Improve disorder prediction using homologous/similar sequences

4 Methodology To enhance the training set using homologous sequences:  Find homologous sequences that have segments similar to the disorder proteins in the original dataset  Remove sequences that are too similar to original sequences  Label these segments as disorder  Train disorder predictors with these new data

5 Get homologous Sequences Each disorder segment in the original dataset is sent to the NCBI BLAST Web server Done automatically by a Visual Basic program Search against the non-redundant database (nr), return sequences with E-value < sequences found Discard sequences that are too similar to the original sequences Total 444 sequences left, corresponding to 55 original disorder sequences

6 Which BLAST to use? Standard BLAST We may need scoring matrix specially developed for disorder protein alignments PSI-BLAST It is adaptive and can build scoring matrix based on the results of previous iteration. So, the choice of initial scoring matrix is not very important Current Experiment PSI-BLAST with initial matrix BLOSUM62, use the result of the 1st iteration

7 Train Disorder Predictor Group sequences into families Group newly found sequences according to the original sequences they are similar to. So, there are 145 families total (only 55 families contain new sequences) Neural Network + Bagging Randomly sampling a BALANCED training set and train a NN on it. Repeat 10 times and use majority voting to combine 10 NNs Cross-Validation Randomly divide sequences into groups, use 1 group as testing set and the training set is randomly sampled from the rest groups

8 Results ExperimentDisorderOrderAll Avg Std ExperimentDisorderOrderAll Avg Std (a) Without Homologous Sequences (b) With Homologous Sequences The classification accuracies:

9 Conclusion After adding homologous sequences to training set, there are 2% increase on disorder prediction accuracy

10 Thank You!