Protein Domain Analysis Using Hidden Markov Models Liangjiang (LJ) Wang March 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics.

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

Hidden Markov Model.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Protein and RNA Families
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Finding new nirK genes in metagenomic data
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
(H)MMs in gene prediction and similarity searches.
PORTING HMMER AND INTERPROSCAN TO THE GRID Daniel Alberto Burbano Sefair ( ) Michael Angel Pérez Cabarcas.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Protein Families, Motifs & Domains.
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Presentation transcript:

Protein Domain Analysis Using Hidden Markov Models Liangjiang (LJ) Wang March 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 17

Outline Basic concepts and biological problems. Search for protein domains: –The Pfam database, –Other domain/motif databases. Protein domain modeling: –Hidden Markov Models (HMM), –Construction of the Pfam protein domain models using HMMER.

Biological Problem #1 You identified a new gene, which might be involved in a very interesting biological process. BLAST search in GenBank resulted in a few homologous sequences with unknown function. What else can you do to understand the function of the gene product and/or to localize the possible conserved domain in the protein?

Biological Problem #2 Suppose there is a novel gene identified in mammals, C. elegans and Drosophila, but not yet in plants. This gene is involved in an interesting biological process (e.g., apoptosis). You are interested in finding the orthologous gene in Arabidopsis. However, BLAST search using each of the known sequences failed to identify an Arabidopsis homologue. What else can you try?

Orthologs, Paralogs and Homologs XY XX X1X1 X2X2 YY YaYa YbYb Ancestral organism Speciation A B A B X 1 and X 2 are orthologs with same function. Paralogs Y a and Y b may have different but related functions. Duplication Homologs

Protein Domains Domains represent evolutionarily conserved amino acid sequences carrying functional and structural information of a protein. Domain analysis helps understand the biological function of a gene product. bZIP

Protein Domain Analysis Using HMM Multiple Sequence Alignment HMMER Search Hidden Markov Models Your Sequence Set >TC50726 AIKLNDVKSCQGTAFWMA PEVVRGKVKGYGLPADIW SLGCTVLEMLTGQVPYAP MECISAMFRIGKGELPPV PDTLSRDARDFILQCLKV NPDDRPTAAQLLDHKFVQ RSFSQSSGSASPHIPRRS >UFO_ARATH MDSTVFINNPSLTLPFSY TFTSSSNSSTTTSTTTDS SSGQWMDGRIWSKLPPPL LDRVIAFLPPPAFFRTRC

Comparison of Search Approaches BLASTHMM Threading Sensitivity Speed Low Very Fast High Fast Very High Very Slow

The Pfam Database Pfam is a database of multiple alignments and hidden Markov models (HMMs) of common conserved protein domains. The alignments use a non-redundant protein set composed of SWISS-PROT and TrEMBL. Pfam consists of parts A and B. Pfam-A contains curated domain families with high- quality alignments. Pfam-B contains families that were generated automatically by clustering the remaining sequences after removal of Pfam-A domains. Pfam is available at

Other Domain/Motif Databases ProDom: contains domain families automatically generated from the SWISS-PROT and TrEMBL (Pfam-B). SMART: Simple Modular Architecture Research Tool; available at contains domain families that are widely represented among nuclear, signaling and extracellular proteins. TIGRFAMs: is a collection of manually curated protein families of hidden Markov models; contains models of full- length proteins and shorter protein regions.

More Domain/Motif Databases PROSITE: consists of biologically significant sites, patterns and profiles; uses regular expression to represent most patterns. PRINTS: ; a collection of protein fingerprints (conserved motifs, ungapped alignments), which may be used to assign new sequences to known protein families. Blocks: consists of short ungapped alignments corresponding to the most highly conserved regions of proteins.

Even More Domain/Motif Databases InterPro: an integrated and curated collection of protein families, domains and motifs from PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs. CDD: ; contains domains derived from Pfam, SMART and models curated at NCBI. 3Dee: contains structural domain definitions for all protein chains in the Protein Databank (PDB); clustered by both sequence and structural similarity.

Why So Many Domain/Motif Databases? Different representations of patterns: –PROSITE: regular expression. –ProDom: multiple alignment and consensus. –Pfam: multiple alignment and HMM. Different approaches or focuses: –SMART: focused on signaling proteins. –PRINTS and Blocks: highly conserved segments. –3Dee: structural domain definitions. “Meta-sites” (databases of databases): –InterPro: an integrated collection, derived from several domain/motif databases.

Protein Domain Modeling Machine learning concepts. Hidden Markov Models (HMM). HMMER (a software tool for constructing and searching HMM). Construction of the Pfam protein domain models.

Machine Learning The study of computer algorithms that automatically improve performance through experience. In practice, this means: we have a set of examples from which we want to extract some rules (regularities) using computers. Two types of machine learning: –Supervised: learn with a teacher (using a set of input-output training examples). –Unsupervised: let the machine explore the data space and find some interesting patterns.

Learning from Examples Learning refers to the process in which a model is generalized (induced) from given examples (training dataset). Error-correction learning: for each of the given examples, a computer program –makes a prediction based on what was already learned (i.e., model parameters). –compares the prediction with the given output to calculate the error. –adjusts the model parameters in some way (learning algorithm) to minimize the error.

Common Pitfalls - Training Dataset Data spaceData instances sampled Too few examples (overfitting) Sampling problem Good (“Garbage in, garbage out”)

Hidden Markov Model (HMM) A class of probabilistic models that are generally applicable to time series or linear sequences. Widely used in speech recognition since early 1970s. David Haussler’s group at UC Santa Cruz introduced HMMs for biological sequence profiles in HMM turns a multiple alignment into a position-specific scoring system that can be used to search for remotely homologous sequences.

The Occasionally Dishonest Casino Problem The casino has two dies: a fair and a loaded die. They use the fair die most of the time, but occasionally (P = 0.05) switch to the loaded die and may switch back to a fair die with probability 0.1. The loaded die has probability 0.5 of a six and probability 0.1 for the numbers one to five. The fair die has probability for each number. Rolls Symbol Die FFFFFFFFLLLLLLLLLLFFFFFF State/Path HMM  The state sequence or path is hidden (HMM).  Transition probabilities: P (L|F) = 0.05; P (F|F) =  Emission probabilities: P (6|L) = 0.5; P (6|F) =

An HMM for the Casino Problem 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 FairLoaded 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 Emission Probability Transition Probability

An HMM for 5’ Splice Site Recognition (Eddy, 2004) States: E – Exon 5 – 5’ splice site I – Intron An observation (nucleotide sequence) corresponds to a state path (or paths) through the HMM.

Finding the Best Hidden State Path (Eddy, 2004) The probability P of a state path, given the model and an observation (sequence), is the product of all the emission and transition probabilities along the path.

Calculating the Probability of a State Path

How to Model a Protein Domain? A.A. EDQILIKARNTEAARRSRVIANYL Symbol DomX? NNNNNNNNYYYYYYYYYYNNNNNN State/Path Consider a two-state HMM: Is there a domain X (Yes/No)? Seq1 KGIQEF--GADWYKVAK--NVGNKSPEQCILRFLQ Seq2 ALVKKHGQG-EWKTIAS--NLNNRTEQQCQHRWLR Seq3 SGVRKYGEG-NWSKILLHYKFNNRTSVMLKDRWRT Is this sufficient for modeling a protein domain? How to represent position-dependent amino acid distribution? What about insertions and deletions? No

An HMM for Protein Domain Recognition (Eddy, 1996) States: M - match D - delete I - insert

HMM Parameterization (Training) HMM parameters are estimated from the multiple sequence alignment. –Basic: maximum likelihood estimation. –Advanced: the MAP construction algorithm. (See Durbin et al., Biological sequence analysis, p ) A High-quality alignment is essential for the model construction. This includes selection of sequences and manual editing of the multiple sequence alignment generated by the ClustalW program.

Scoring a Sequence with an HMM The task is to find the hidden state path with the highest probability, given the model and an observation (sequence). –The Viterbi algorithm (dynamic programming). –The forward algorithm. –The backward algorithm. (See Durbin et al., Biological Sequence Analysis, p.55-61)

HMM versus PWM Advantages: –A HMM has position-dependent amino acid distributions, which are represented as emission probabilities at each match state. (also PWM) –Insertion/deletion gap penalties are handled using transition probabilities. (Usually not with PWM) –The possible dependence of an amino acid on its preceding neighbor can be represented using the transition probabilities. (Not with PWM) Problems: –Long-range interactions between amino acids. –Requirement of multiple sequence alignments.

HMMER A software package for constructing and searching HMMs. Source code and binary distribution for various platforms (UNIX, Linux and Macintosh PowerPC) are available at Follow the detailed User’s Guide for software installation. Multiple sequence alignment: ClustalW or ClustalX (with Windows interface), available at ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/.ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ Sequences in FASTA format.

HMMER Programs hmmbuild: build a model from a multiple sequence alignment. hmmalign: align multiple sequences to a HMM. hmmcalibrate: determine appropriate statistical significance parameters for an HMM prior to database searches. hmmsearch: search a sequence database with an HMM. hmmpfam: search an HMM database with one or more sequences. hmmconvert and hmmindex.

Construction of the Pfam HMMs PROSITE, literature Family definition Seed alignment HMM profile Full alignment ClustalW, editing hmmbuild hmmalign If the HMM doesn’t find all members (representative, stable) (complete, volatile)

A Solution to Problem #2 Collect known sequences in literature Do multiple alignment (ClustalX, editing) Create an HMM profile using hmmbuild Search an Arabidopsis sequence dataset using the HMM and hmmsearch

Other Tools for Protein Pattern Analysis SignalP: –For predicting signal peptide and cleavage site. –Available at PSORT: –For predicting protein localization sites in cells. –Available at TMHMM: –For predicting transmembrane segments. –Available at

Summary Hidden Markov Model (HMM) is well suited to represent protein domains. Since HMMs are constructed from aligned sequence families, HMM search is often more sensitive than BLAST for detecting remotely related homologues. Resources are available for modeling and searching for protein domains/motifs.

PROSITE vs. Perl RegExp PDOC00269 (Heat shock hsp70 signature) PROSITE: [IV]-D-L-G-T-[ST]-x-[SC] Perl: [IV]DLGT[ST]\w[SC] PDOC50884 (Part of Zinc finger Dof-type signature) PROSITE: C-x(2)-C-x(7)-[CS]-x(13)-C-x(2)-C Perl: C\w{2}C\w{7}[CS]\w{13}C\w{2}C PDOC00081 (Part of Cytochrome P450 signature) PROSITE: [FW]-[SGNH]-x-[GD]-{F}-[RKHPT]-{P}-C Perl: [FW][SGNH]\w[GD][^F][RKHPT][^P]C PDOC00036 (Part of bZIP domain signature) PROSITE: [KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L} Perl: [KR]\w{1,3}[RKSAQ]N[^VL]\w[SAQ]{2}[^L]