Applications of HMMs Yves Moreau 2003-2004. Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
(CHAPTER 12- Brooker Text)
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Chapter 17: From Gene to Protein Objectives 1. To understand the central dogma 2.To understand the process of transcription 3.To understand the purpose.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Chapter 17 From Gene to Protein. Gene Expression DNA leads to specific traits by synthesizing proteins Gene expression – the process by which DNA directs.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Chapter 17: From Gene to Protein. Figure LE 17-2 Class I Mutants (mutation In gene A) Wild type Class II Mutants (mutation In gene B) Class III.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
An introduction to gene prediction. Content Introduction Prokaryotes Start/stop, operons Eukaryotes Start/stop promoter/polyA Intron/exons/UTR Problems.
bacteria and eukaryotes
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
4. HMMs for gene finding HMM Ability to model grammar
Presentation transcript:

Applications of HMMs Yves Moreau

Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes vs. eukaryotes Gene prediction by homology GENSCAN

Profile HMM Hidden Markov model for the modeling of protein families and for multiple alignment Example Part of the alignment of the SH3 domain Two conserved regions separated by a variable region GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV DEWWQArr..deqiGIVPSK-- GEWWKAqs..tgqeGFIPFNFV GDWWLArs..sgqtGYIPSNYV GDWWDAel..kgrrGKVPSNYL -DWWEArslssghrGYVPSNYV GDWWYArslitnseGYIPSTYV GEWWKArslatrkeGYIPSNYV GDWWLArslvtgreGYVPSNFV GEWWKAkslsskreGFIPSNYV GEWCEAgt.kngq.GWVPSNYI SDWWRVvnlttrqeGLIPLNFV LPWWRArd.kngqeGYIPSNYI RDWWEFrsktvytpGYYESGYV EHWWKVkd.algnvGYIPSNYV IHWWRVqd.rngheGYVPSSYL KDWWKVev..ndrqGFVPAAYV

Profile HMMs Hidden Markov Models for multiple alignments Match, insert, and delete states BgnEnd Match Insertion Deletion

Silent deletion states Deletions could be modeled by shortcut jumps between states Problem: number of transitions grows quadratically Other solution: use parallel states that do not produce any symbol (silent state)

HMM from multiple alignment GGWWRGdy.ggkkqLWFPSNYV IGWLNGynettgerGDFPGTYV PNWWEGql..nnrrGIFPSNYV DEWWQArr..deqiGIVPSK-- GEWWKAqs..tgqeGFIPFNFV GDWWLArs..sgqtGYIPSNYV GDWWDAel..kgrrGKVPSNYL -DWWEArslssghrGYVPSNYV GDWWYArslitnseGYIPSTYV GEWWKArslatrkeGYIPSNYV GDWWLArslvtgreGYVPSNFV GEWWKAkslsskreGFIPSNYV GEWCEAgt.kngq.GWVPSNYI SDWWRVvnlttrqeGLIPLNFV LPWWRArd.kngqeGYIPSNYI RDWWEFrsktvytpGYYESGYV EHWWKVkd.algnvGYIPSNYV IHWWRVqd.rngheGYVPSSYL KDWWKVev..ndrqGFVPAAYV Multiple alignment (+ conserved columns) Parameter estimation = estimation with known paths.85 Corresponding profile HMM

Pseudocounts Zero probabilities in HMM causes the rejection of sequences containing previously unseen residues To avoid this problem, add pseudocounts (add extra counts as if prior data was available) New profile HMM.85.33

Database search with profile HMM The estimated model can be used to detect new members of the protein family in a sequence database (more sensitive than PSI-BLAST) For each sequence in the database, we compute P(x,  * | M) (Viterbi) or P(x | M) (forward-backward) In practice we work with log-odds (w.r.t. the random model P(x | R) )

Alignment to profile HMM Through Viterbi (search for the best alignment path), we can align sequences w.r.t a profile HMM Training sequences Database matches

Multiple alignment with profile HMM If the sequences are not aligned, it is possible to train a profile HMM to align them Initialization: choose the length of the profile HMM Length of profile HMM is number of match states  sequence length Training: estimate the model via Viterbi training or Baum-Welch training Heuristics to avoid local minimas Multiple alignment: use Viterbi decoding to align sequences

Extensions More sophisticated pseudocounts are possible Dirichlet mixtures Different types of local alignments can be done with HMMs Methods are available to weigh sequences in function of evolutionary distances

Protein families PFAM Collection of protein families and protein domains Provides multiple alignment of the protein families for the domains Provides the domain organization of proteins Provides profile HMMs of the domains

Software for profile HMMs SAM: University of California Santa Cruz Web service: apps/HMM-applications.html (takes time) apps/HMM-applications.html Hmmer (‘hammer’): Washington University, St. Louis

Gene finding

Overview Elements of gene prediction Prokaryotes vs. eukaryotes Gene prediction by homology GENSCAN

DNA makes RNA makes proteins

Evidence for gene prediction Sources of evidence (positive and negative) Sequence similarity to known genes (e.g., found by BLASTX) Statistical measure of codon bias Template matches to functional sites (e.g., splice site) Similarity to features not likely to overlap coding sequence (e.g., Alu repeats) The structure must respect the biological grammar (promoter, exon, intro,...)

Search by signal vs. search by content Search by signal Detect short signals in the genome E.g., splice site, signal peptide, glycosylation site Neural networks can be useful here Search by content Detect extended regions in the genome e.g., coding regions, CpG islands Hidden Markov Models are useful here Gene finding algorithms combine both

Probabilistic prediction vs. homology Hidden Markov Models can be used to predict genes Homology to a known gene is also a strong method for detecting genes More and more gene prediction packages combine both approaches

Search by signal vs. content

Signals in prokaryotes Transcription start and stop -35 region TATA box Translation start and stop Open Reading Frames Shine-Delgarno motif Start ATG/GTG Stop TAA/TAG/TGA Stem-loops Operon

Problems for prokaryotes Short genes are hard to detect Operons Overlapping genes

Signals in eukaryotes Transcription Promotor/enhancer/silencer TATA box Introns/exons Donor/acceptor/branch PolyA Repeats Alu, satellites CpG islands Cap/CCAAT&GC boxes Translation 5’ and 3’ UTR Kozak consensus Start ATG Stop TAA/TAG/TGA

Open reading frames Translate the sequence into the six possible reading frames Check for start and stop codons

Codon bias In coding sequences, genomes have specific biases for the use of codons encoding the same amino acid

Coding potential Most coding potentials are based on analysis of codon usage The HMMs keeps track of some kind of average coding potential around each position The increase and decrease of the coding potential will “push” the HMM in and out of the exons

Promoter region Promoter region contains the elements that control the expression of the gene Prediction of the promoter region (e.g., prediction of the TATA- box) is difficult

Intron-exon splicing Consensus 5’ Donor (A,C)AG/GT(A,G)AGT 3’ Acceptor TTTTTNCAG/GCCCCC Branch CT(G,A)A(C,T) Neural networks can predict splice sites; they can detect complex correlation between positions in a functional site

Gene prediction by homology

Coding regions evolve more slowly than noncoding ones (conserved by natural selection because of their functional role) Not only the protein sequence but also the gene structure can be conserved Use standard homology methods Gene syntax must be respected

Gene prediction by homology

Procrustes Find potentially related with BLASTX (= model sequences) Find all possible blocks (exons) on the basis of acceptor/donor location Look which blocks can be aligned with model sequences Look for best alignment of blocks with the query sequence

Gene prediction by homology Advantages Recognition of short exons and atypical exons Correct assembly of complex genes (> 10 exons) Disadvantages Genes without known homologs are missed Good homologs necessary for the prediction of the gene structure Very sensitive to sequencing errors

GENSCAN

GENSCAN was used for the annotation of the human genome in the Human Genome Project Gene prediction with Hidden Semi-Markov Models Different models in function of GC-content ( 57%)

Typical gene structure

Signal: human splice site 5’ splice site 3’ splice site

Hidden semi-Markov model

Example Nodes of HSMM Position-weight matrix (signal) Higher-order position-weight matrix HMM (content)

Architecture of GENSCAN

Training of HSMM Viterbi algorithm Viterbi algorithm for HSMMs

Gene structure prediction Current performance on exon prediction is acceptable However, grouping the correct exons into the genes is still problematic In many cases, a significant proportion of the predicted genes will not be correct

CpG islands In mammalians, CpG islands have higher G+C and CG dinucleotide content than the rest of the DNA CpG islands arise in active regions where no deactivation by methylation takes place (CG dinucleotides in methylated regions disappear by deamination) CpG islands may be used as gene markers in mammalians

Repeats Repeats make up a large part of the human genome Alu repeats Long Interspersed Elements (LINEs) Short Interspersed Elements (SINEs) Important to mask repeats when searching for genes

Promoter, enhancers, and silencers

Promotor, enhancers en silencers

Polyadenylation signal Polyadenylation (cleavage of pre-mRNA 3' end and synthesis of poly-(A) tract) is a very important early step of pre-mRNA processing The most well-known signal involved in this process is AATAAA, located nucleotides upstream from the poly-(A) site (site of cleavage) Real AATAAA signals can differ from AATAAA consensus sequence. The most frequent natural variant, ATTAAA, is nearly as active as the canonical sequence.

Problem: alternative splicing

Problem: pseudogenes Loss of promoter, extra stop codon, frameshift Translocation, duplication

Problem: RNA genes rRNA (ribosomal) tRNA (transfer) snRNA (splicing) tmRNA (telomerase) microRNAs

Neural networks for exon prediction GRAIL uses a neural network to predict the score of a candidate exon