1 Computational Analysis of Protein-DNA Interactions Changhui (Charles) Yan Department of Computer Science Utah State University.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Protein Structure and Physics. What I will talk about today… -Outline protein synthesis and explain the basic steps involved. -Go over the Chemistry of.
Yu-Feng Huang 1, Chun-Chin Huang 2, Yu-Cheng Liu 3, Yen-Jen Oyang 1,4,5, Chien-Kang Huang 2 * 1 Department of Computer Science and Information Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
Introduction to Bioinformatics
McPromoter – an ancient tool to predict transcription start sites
Structural bioinformatics
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Protein Modules An Introduction to Bioinformatics.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Correlate Phosphorylation Sites to Kinases by Conditional Random Fields --- CS 104 Project Lu He, Tuobin Wang.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Department of Biochemistry
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Goals: Discuss 3 examples of transcriptional regulation -Lac operon -Coordinated gene regulation -Regulation of transcription without regulation of polymerase.
Protein Tertiary Structure Prediction
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
School of Pharmacy Medical University of Sofia
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Protein and RNA Families
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
EB3233 Bioinformatics Introduction to Bioinformatics.
Identification of Helix-Turn-Helix (HTH) DNA-Binding Motifs
B IOINFORMATICS AND C OMPUTATIONAL B IOLOGY A Computational Method to Identify RNA Binding Sites in Proteins Jeff Sander Iowa State University Rocky 2006.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
An Exercise in Machine Learning
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
RNA Structure Prediction
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein Families, Motifs & Domains.
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Predicting Active Site Residue Annotations in the Pfam Database
There are four levels of structure in proteins
Sequence Based Analysis Tutorial
Comparison of EM reconstructions and crystal structures of mAbs that bind at the GP1-GP2 interface. Comparison of EM reconstructions and crystal structures.
Sequence Based Analysis Tutorial
Protein structure prediction.
Presentation transcript:

1 Computational Analysis of Protein-DNA Interactions Changhui (Charles) Yan Department of Computer Science Utah State University

2 I Problem I Identifying amino acid residues involved in protein-DNA interactions from sequence

3 Materials And Methods 56 double-stranded DNA binding proteins previously used in the study of Jones et al. (2003) Encoding

4 Materials And Methods

5 Leave-one-out cross-validation Na ï ve Bayes Naïve Bayes Classifier

6 Na ï ve Bayes Naïve Bayes Classifier Leave-one-out cross-validation

7 Leave-One-Out Cross-Validations Sequence-basedSequence/structure-based Identities (ID) ID + entropyID + rASAID + rASA + entropy Correlation coefficient Accuracy(%) Specificity+(%) Sensitivity+(%)

8 Pit-1, PDB 1au7 TP:30 FP: 16 TN: 86 FN:14 CC: 0.51 (2 nd ) Accuracy: 79% Predicted Actual Predictions in The Context of 3-D Structures

9 -Cro, PDB 6cro TP:10 FP: 5 TN: 34 FN:10 CC: 0.37 (19 th ) Accuracy: 73% PredictedActual

10 Predictions C With PROSITE Motifs Predictions Compared With PROSITE Motifs Predicted binding sites substantially overlap with 34 of the 37 “DNA-binding” PROSITE motifs In 52 of the 56 proteins, the predictor identifies at least 20% of the DNA-binding residues 28 of the 56 proteins contain no PROSITE motifs that are annotated as “DNA-binding”

11 Comparison With Previous Study MethodNaïve Bayes classifier Ahmad and Sarai method * C Correlation Coefficient Accuracy (%)8066 Specificity+(%)2921 Sensitivity+(%)4868 * Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.

12 Summary A simple sequence-based Naive Bayes classifier predicts interface residues in DNA-binding proteins with 75% accuracy, 37% specificity+, 53% sensitivity+ and correlation coefficient of 0.29 Predicted binding sites correctly indicate the locations of actual binding sites substantially overlap with known PROSITE motifs

13 Problem II Identification of Helix-Turn-Helix (HTH) DNA-binding motifs

14 HTH Motifs Sequences sharing low similarities can fold into a similar HTH structure Sequences sharing low similarities can fold into a similar HTH structure Identifying HTH motifs from sequence is extremely challenging Identifying HTH motifs from sequence is extremely challenging

15 Trick 1 Including more information Including more information Amino acid sequence Amino acid sequence Secondary structure Secondary structure

16 Hidden Markov Model (HMM) LQQITHIANQL-GLE----KDVVRVWF

17 Hidden Markov Model (HMM_AA_SS) LQQITHIANQL-GLE----KDVVRVWF HHHEEHEEEHMHE----HHEEMMEH

18 Trick 2 There are similarities among the 20 naturally occurred amino acids There are similarities among the 20 naturally occurred amino acids Reduced alphabets Reduced alphabets

19 Reduced Alphabets Schemes for reducing amino acid alphabet based on the BLOSUM50 matrix by Henikoff and Henikoff (1992) derived by grouping and averaging the similarity matrix elements as described in the text. (Murphy et al. 2000)

20 Cross-Families Evaluations True Positive 1 False Positive 2 HMM_AA30 HMM_AA_SS (20 letters) HMM_AA_SS (Murphy_15) HMM_AA_SS (Murphy_10) HMM_AA_SS (Murphy_8) True positive: HTH motifs that are correctly identified as such. 2.False positive: Non-HTH motifs that are identified as HTH motifs. 3.The alphabet used to encode amino acid sequences.

21 Questions

22 Within-family Three-Fold Cross-Validations. Family (number of HTH motifs in the family) HMM_AAHMM_AA_SS (Murphy_15) PF00126 (1635) PF00165 (90)6380 PF00196 (30)2630 PF04545 (164) PF01022 (42)39 PF00046 (189) PF03965 (48)48

23 Comparisons of HMM_AA_SS with FFAS03 in Cross-Family Evaluations Total HTH motifs Recognized by both FFAS03 and HMM_AA_SS Recognized by FFAS03 only Recognized by HMM_AA_SS only

24 Putative HTH motifs in Ureaplasma parvum ProteinLocationAnnotation from Uniprot sp|Q9PQE5|SCPB_UREPA Participates to chromosomal partition during cell division sp|Q9PQV6|RPOB_UREPA DNA-directed RNA polymerase sp|Q9PR27|SYY_UREPA Tyrosyl-tRNA synthetase sp|Q9PQC2|SYA_UREPA Alanyl-tRNA synthetase sp|Q9PQ74|DPO3A_UREPA DNA polymerase III subunit alpha sp|Q9PQX7|Y166_UREPA Hypothetical protein