Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.

Similar presentations

Presentation on theme: "Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship."— Presentation transcript:

1 Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship (Phylogeny) 3-D fold model Protein sorting and sub-cellular localization Anchoring into the membrane Signal sequence (tags)  Some nascent proteins contain a specific signal, or targeting sequence that directs them to the correct organelle. ( ER, mitochondrial, chloroplast, lysosome, vacuoles, Golgi, or cytosol )

2  Can we train the computers:  To detect signal sequences and predict protein destination?  T o identify conserved domains (or a pattern) in proteins?  To predict the membrane-anchoring type of a protein? ( Transmembrane domain, GPI anchor… )  T o predict the 3D structure of a protein?  Learning algorithms are good for solving problems in pattern recognition because they can be trained on a sample data set.  Classes of learning algorithms: -Artificial neural networks (ANNs) -Hidden Markov Models (HMM) Questions

3 Artificial neural networks (ANN)  Machine learning algorithms that mimic the brain. Real brains, however, are orders of magnitude more complex than any ANN.  ANNs, like people, learn by example. ANNs cannot be programmed to perform a specific task.  ANN is composed of a large number of highly interconnected processing elements (neurons) working simultaneously to solve specific problems.  The first artificial neuron was developed in 1943 by the neurophysiologist Warren McCulloch and the logician Walter Pits.

4 Hidden Markov Models (HMM)  HMM is a probabilistic process over a set of states, in which the states are “hidden”. It is only the outcome that visible to the observer. Hence, the name Hidden Markov Model.  HMM has many uses in genomics:  Gene prediction (GENSCAN)  SignalP  Finding periodic patterns  Used to answer questions like:  What is the probability of obtaining a particular outcome?  What is the best model from many combinations?

5  Expasy server ( is dedicated to the analysis of protein sequences and structures. The ExPASy (Expert Protein Analysis System)  Sequence analysis tools include:  DNA -> Protein [ Translate ]  Pattern and profile searches  Post-translational modification and topology prediction  Primary structure analysis  Structure prediction (2D and 3D)  Alignment

6  PredictProtein: A service for sequence analysis, and structure prediction  TMpred:  TMHMM: Predicts transmembrane helices in proteins (CBS; Denmark)  big-PI : Predicts GPI-anchor site :  DGPI: Predicts GPI-anchor site :  SignalP : Predicts signal peptide :  PSORT: Predicts sub-cellular localization:  TargetP: Predicts sub-cellular localization:  NetNGlyc: Predicts N-glycosylation sites :  PTS1: Predicts peroxisomal targeting sequences  MITOPROT: Predicts of mitochondrial targeting sequences  Hydrophobicity :

7 Multiple alignment  Used to do phylogenetic analysis:  Same protein from different species  Evolutionary relationship: history  Used to find conserved regions  Local multiple alignment reveals conserved regions  Conserved regions usually are key functional regions  These regions are prime targets for drug developments  Protein domains are often conserved across many species  Algorithm for search of conserved regions:  Block maker :

8 Multiple alignment tools  Free programs:  Phylip and PAUP :  Phyml :  The most used websites :    (T-COFFEE and ClustalW)  ClustalW:  Standard popular software  It aligns 2 and keep on adding a new sequence to the alignment  Problem: It is simply a heuristics.  Motif discovery: use your own motif to search databases :  PatternFind:

9 Phylogenetic analysis  Phylogenetic trees  Describe evolutionary relationships between sequences  Major modes that drive the evolution:  Point mutations modify existing sequences  Duplications (re-use existing sequence)  Rearrangement  Two most common methods  Maximum parsimony  Maximum likelihood  The most useful software:

10 Parsimony vs Maximum likelihood  Parsimony is the most popular method in which the simplest answer is always the preferred one.  It involves statistical evaluation of the number of mutations need to explain the observed data.  The best tree is the one that requires the fewest number of evolutionary changes.  Likelihood generally performs better than parsimony  I n contrast, maximum likelihood does not necessarily satisfy any optimality criterion. It attempts to answer the question:  What parameters of evolutionary events was likely to produce the current data set?  This is computationally difficult to do. This is the slowest of all methods.

11 Definitions  Homologous: Have a common ancestor. Homology cannot be measured.  Orthologous: The same gene in different species. It is the result of speciation (common ancestral)  Paralogous : Related genes (already diverged) in the same species. It is the result of genomic rearrangements or duplication

12 Determining protein structure  Direct measurement of structure  X-ray crystallography  NMR spectroscopy  Site-directed mutagenesis  Computer modeling  Prediction of structure  Comparative protein-structure modeling

13 Comparative protein-structure modeling  Goal: Construct 3-D model of a protein of unknown structure (target), based on similarity of sequence to proteins of known structure (templates) Blue : predicted model by PROSPECT Red : NMR structure  Procedure:  Template selection  Template–target alignment  Model building  Model evaluation

14 The Protein 3-D Database  The Protein DataBase (PDB) contains 3-D structural data for proteins  Founded in 1971 with a dozen structures  As of June 2004, there were 25,760 structures in the database. All structures are reviewed for accuracy and data uniformity.  Structural data from the PDB can be freely accessed at  80% come from X-ray crystallography  16% come from NMR  2% come from theoretical modeling

15 High-throughput methods

16 Most used websites for 3-D structure prediction  Protein Homology/analogY Recognition Engine (Phyre) at  PredictProtein at  UCLA Fold Recognition at

17 Commercial bioinformatics softwares CLC Genomics Workbench  Genomics: 454, Illumina Genome Analyzer and SOLiD sequencing data; De novo assembly of genomes of any size; Advanced visualization, scrolling, and zooming tools; SNP detection using advanced quality filtering;  Transcriptomics: RNA-seq including paired data and transcript-level expression; Small RNA analysis; Expression profiling by tags;  Epigenetics: Chromatin immunoprecipitation sequencing (ChIP-seq) analysis; Peak finding and peak refinement; Graph and table of background distribution; false discovery rate; Peak table and annotations;  VectorNTI: Sequence analysis and illustration; restriction mapping; recombinant molecule design and cloning; in silico gel electrophoresis; synthetic biology workflows  AlignX:  BioAnnotator:  ContigExpress :  GenomBench

18 The bioinformatics not covered in this class  Comparative genomics and Genome browser:  Genome annotation: http:// http://  Metagenomics:  System biology tools.

Download ppt "Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship."

Similar presentations

Ads by Google