LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
High Throughput Computing and Protein Structure Stephen E. Hamby.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
Mutual Information Mathematical Biology Seminar
Heuristic alignment algorithms and cost matrices
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,
Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)
Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Automatic assignment of NMR spectral data from protein sequences using NeuroBayes Slavomira Stefkova, Michal Kreps and Rudolf A Roemer Department of Physics,
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Tools of Bioinformatics
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo
Computational prediction of protein-protein interactions Rong Liu
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Jia-Jyun Dong, Yu-Hsiang Tung, Chien-Chih Chen, Jyh-Jong Liao, Yii-Wen Pan 學 生 :陳奕愷 Logistic regression model for predicting the failure probability of.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Bioinformatics and Computational Biology
Customer Relationship Management (CRM) Chapter 4 Customer Portfolio Analysis Learning Objectives Why customer portfolio analysis is necessary for CRM implementation.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Copyright OpenHelix. No use or reproduction without express written consent1.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
Evaluation of Gender Classification Methods with Automatically Detected and Aligned Faces Speaker: Po-Kai Shen Advisor: Tsai-Rong Chang Date: 2010/6/14.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Challenges in Creating an Automated Protein Structure Metaserver
Introduction Feature Extraction Discussions Conclusions Results
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)

Reecha Khanal Mentor: Avdesh Mishra Supervisor: Dr. Md Tamjidul Hoque
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition Name : WangShanyi ID : 14S051067

DNA-binding protein play crucial roles in various cellular process Method: iDNA-Prot|dis incorporating the amino acid distance-pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition Verification: Jackknife and Independent dataset tests Web-Server: Prot_dis/. Abstract

Introduction  DNA-binding proteins are essential ingredients for both eukaryotic and prokaryotic cell. They can interact with DNA, and play crucial role in various cellular processes.

Introduction  computational methods: 1. the structure-based method major defection: Unavailable for huge amount 2. the sequence-based method

Introduction  Procedure: (1)Construct or select a valid benchmark dataset to train and test the predictor; (2)Formulate the samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (3)Introduce or develop a powerful algorithm (or engine) to operate the prediction; (4)properly perform cross-validation tests to objectively evaluate its anticipated accuracy; (5)establish a user-friendly web-server for the predictor that is accessible to the public.

Materials and Method 

 PseAAC of Distance-Pairs and Reduced Alphabet Scheme how to effectively formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information ??? 1 the number of biological sequences with different sequence-orders is extremely high and their lengths vary widely. 2 SVM, neural network, random forest etc can only handle vector but not length-different sequences.

Materials and Method 

 Reduced Amino Acid Alphabet Scheme high-dimension disaster !!! (1)unnecessarily increasing the computational time; (2)misrepresentation due to information redundancy or noise that will lead to poor prediction accuracy; (3)the overfitting problem that will make the predictor with a serious bias and extremely low capacity for generalization.

Materials and Method  Reduced Amino Acid Alphabet Scheme

Materials and Method  Reduced Amino Acid Alphabet Scheme

Materials and Method  Reduced Amino Acid Alphabet Scheme Advantages: cut down the dimension of the PseAAC vector and improve the predictive performance.

Materials and Method  Support Vector Machine SVM is based on the structural risk minimization principle from statistical learning theory.

Materials and Method  Evaluation Method of Performance How to properly examine the prediction quality is a key for developing a new predictor and estimating its potential application value. cross-validation : independent dataset test, subsampling or K- fold (such as 5-fold, 7-fold, or 10-fold) test, and jackknife test.

Materials and Method  Evaluation Method of Performance in literature a set of four metrics called the sensitivity (Sn),specificity (Sp), accuracy (Acc), and Mathew’s correlation coefficient (MCC)

Results and Discussion  Impact of the Pairwise Distance on the iDNA-Prot|dis pairwise distance d=???

Results and Discussion  Discriminant Visualization and Interpretation

Results and Discussion  Discriminant Visualization and Interpretation The discriminative power of all the 400 distance amino acid pairs in iDNA-Prot|dis is depicted in Fig. 3A. Top three most discriminative amino acid pairs are R-R, K-R, and R-K according to the three darkest spots.

Results and Discussion  Discriminant Visualization and Interpretation A specific DNA-binding protein 1HLV chain A was selected to investigate if the most discriminative distance amino acid pairs R-R reflect the characteristics of this DNA- binding protein. For d=3,including RR, R*R, and R**R with distance 1, 2, 3, respectively.

Results and Discussion  Discriminant Visualization and Interpretation

Results and Discussion  Reduced Amino Acid Alphabet Scheme A reduced alphabet is any clustering of amino acids based on some measure of their relative similarity Question: Which is best ???

Results and Discussion  Comparison with Other Related Methods

Results and Discussion  Comparison with Other Related Methods provide a graphic illustration to show the performances of the four predictors, the corresponding ROC (receiver operating characteristic) curves

Results and Discussion  Independent Test we also extended the comparison with other methods via an independent dataset test

Results and Discussion  Web-Server Guide Step 1. Open the web-server by clicking the link at bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/ and you will see its top page. Click on the Read Me button to see a brief introduction about the server. Step 2. Check the open circle to select which alphabet profile you are to use for conduct prediction. Step 3. Either type or copy and paste the query protein sequence into the input box at the center, or you can also upload your input data by the Browse button. The input sequence should be in the FASTA format. Step 4. Click on the Submit button to see the predicted result.

Conclusions It is anticipated that the iDNA-Prot|dis predictor will become a high throughput tool for both basic research and drug development.

THANKS