Eukaryotic Gene Finding

Slides:

Advertisements

Similar presentations

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.

Advertisements

Ab initio gene prediction Genome 559, Winter 2011.

SBI 4U November 14 th, What is the central dogma? 2. Where does translation occur in the cell? 3. Where does transcription occur in the cell?

Computational Gene Finding using HMMs

Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry

Finding Eukaryotic Open reading frames.

Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,

Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.

1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.

Gene Finding Charles Yan.

CSE182-L10 Gene Finding.

CSE182-L12 Gene Finding.

Comparative ab initio prediction of gene structures using pair HMMs

Eukaryotic Gene Finding

Lecture 12 Splicing and gene prediction in eukaryotes

CSE182-L10 MS Spec Applications + Gene Finding + Projects.

Protein Synthesis.

Genome Annotation BCB 660 October 20, From Carson Holt.

Biological Motivation Gene Finding in Eukaryotic Genomes

Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.

Transcription: Synthesizing RNA from DNA

Gene Structure and Identification

Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.

Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.

Genome Annotation BBSI July 14, 2005 Rita Shiang.

Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.

DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.

Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.

Genetics 3: Transcription: Making RNA from DNA. Comparing DNA and RNA DNA nitrogenous bases: A, T, G, C RNA nitrogenous bases: A, U, G, C DNA: Deoxyribose.

1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.

Genome Annotation Rosana O. Babu.

Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Mark D. Adams Dept. of Genetics 9/10/04

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

From Genomes to Genes Rui Alves.

Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.

JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.

Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.

Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,

(H)MMs in gene prediction and similarity searches.

Finding genes in the genome

Annotation of eukaryotic genomes

1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.

Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford

TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.

Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.

Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

bacteria and eukaryotes

Annotating The data.

”Gene Finding in Eukaryotic Genomes”

Exam #1 is T 9/23 in class (bring cheat sheet).

Genes, Genomes, and Genomics

Eukaryotic Gene Finding

Ab initio gene prediction

Recitation 7 2/4/09 PSSMs+Gene finding

Gene Annotation with DNA Subway

Introduction to Bioinformatics II

4. HMMs for gene finding HMM Ability to model grammar

Genome Annotation and the Human Genome

From gene to protein.

Gene Structure.

Gene Structure.

Presentation transcript:

Eukaryotic Gene Finding Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/

Prokaryotic vs. Eukaryotic Genes Prokaryotes small genomes high gene density no introns (or splicing) no RNA processing similar promoters overlapping genes Eukaryotes large genomes low gene density introns (splicing) RNA processing heterogeneous promoters polyadenylation

Pre-mRNA Splicing ... ... U 1 s n R N P 2 intronic repressor 5 ’ splice signal U 2 A F 6 5 3 1 s n R N P SR proteins intron definition exon definition exonic enhancers 5 ’ splice signal 3 polyY branch signal intronic enhancers exonic repressor ... (assembly of spliceosome, catalysis) ...

Some Statistics On average, a vertebrate gene is about 30KB long Coding region takes about 1KB Exon sizes can vary from double digit numbers to kilobases An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp but both can be much longer.

Human Splice Signal Motifs

Semi-Markov HMM Model

GHMM A finite Set Q of states Initial state distribution Π Transition probabilities Ti,j for Length distribution f of the states (fq is the length distribution of state q) Probability model for each state

GHMM – contin. A parse Ф of a sequence S of length L is an ordered sequence of states (q1, . . . , qt) with an associated duration di to each state The most probable pass Фopt can be computed as in Veterbi algorithm

Genscan HSMM

GenScan States N - intergenic region P - promoter F - 5’ untranslated region Esngl – single exon (intronless) (translation start -> stop codon) Einit – initial exon (translation start -> donor splice site) Ek – phase k internal exon (acceptor splice site -> donor splice site) Eterm – terminal exon (acceptor splice site -> stop codon) Ik – phase k intron: 0 – between codons; 1 – after the first base of a codon; 2 – after the second base of a codon

GenScan features Model both strands at once Each state may output a string of symbols (according to some probability distribution). Explicit intron/exon length modeling Advanced splice site modeling Complete intron/exon annotation for sequence Able to predict multiple genes and partial/whole genes Parameters learned from annotated genes Separate parameter training for different CpG content groups (< 43%, 43-51%, 51-57%,>57% CG content)

Various parameters in GENSCAN

GenScan Signal Modeling PSSM: P(S) = P1(S1)•P2(S2) •…•Pn(Sn) PolyA signal Translation initiation/termination signal Promoters WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1) 5’ and 3’ splice sites

GENSCAN Performance > 80% correct exon predictions, and > 90% correct coding/non coding predictions by bp. BUT - the ability to predict the whole gene correctly is much lower

HMM-based Gene Finding GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

Using Sequence Similarity for Gene Finding Compare genomic sequence with expressed sequence tags (ESTs) (e.g. by BLASTN), to identify regions corresponding to processed mRNA Compare genomic sequence to Protein DB (e.g. by BLASTX), to identify probably coding regions “Spliced Alignment” of genomic sequence of a complete gene with a homologous protein sequence (e.g. by PROCRUSTES) may enable exon/intron reconstruction Compare predicted peptides (e.g. by GENSCAN) with protein DB to assign confidence to predictions and functional annotations Compare Genomic sequence with homologous from close organisms/species (e.g. by BLAST, CLASTW), to identify conserved regions which might correspond to coding regions and DNA signals “Each of these methods can provide useful information about gene locations, as well as clues to gene function, although similarity based methods are currently (1998) able to identify only about hald of all human genes, and this proportion is increasing rather slowly. It should be kept in mind that similarity bsed mehtods are only as reliable as the DB that are searched, and apparent homology can be misleading at times…” (from Burge review, 98)

GenomeScan proteins are available. Idea: We can enhance our gene prediction by using external information: DNA regions with homology to known proteins are more likely to be coding exons. Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure/composition (GenScan) Focus on ‘typical case’ when homologous but not identical proteins are available.

GeneWise [Birney, Amitai] Motivation: Use good DB of protein world (PFAM) to help us annotate genomic DNA GeneWise algorithm aligns a profile HMM directly to the DNA

Sample GeneWise Output

Developing GeneWise Model Start with a PFAM domain HMM Replace AA emissions with codon emissions Allow for sequencing errors (deletions/insertions) Add a 3-state intron model

GeneWise Model

GeneWise Intron Model PY tract central spacer 5’ site 3’ site

GeneWise Model Viterbi algorithm -> “best” alignment of DNA to protein domain Alignment gives exact exon-intron boundaries Parameters learned from species-specific statistics

GeneWise problems Only provides partial prediction, and only where the homology lies Does not find “more” genes Pseudogenes, Retrotransposons picked up CPU intensive Solution: Pre-filter with BLAST Retrotransposons are explained in p. 484-5 in Genetic Analysis, 5th edition: Basically parts of the genome which are copied multiple times into the DNA genome by means of reverese transcription (from mRNA back into the DNA). Good example is the ~200bp long human Alu sequnence, which we have hunderds of thousends of copies of, making ~5% of our genome. This is an example of SINES ( short interspersed elements). There are also LINES ( long interspersed elements), 1-5kb long, 20k-40k copies of them in the human genome. Elements of this class have ORFs that potentially code for enyzmes used in transposition. Many are the result of RNA virus that replicate through a DNA stage which can integrate into host chromosomes. Pseudogenes are explained in p.480 of the same book: basically these were once copies of a gene that stopped being functional (transcribed), and therefore is now under no selective pressure  it is still very similar to “real” genes, but with many more mutations.

Other Sequence Usage Search translated genomic sequences for the occurrences of the shot peptide motifs that are characteristic of common protein families (e.g. zinc finger, ATP/GTP binding modifs etc.) Identify sequences which are probably NON coding: identify known classes of interspersed repeates (e.g LINE SINE) in none coding regions. Can be essential to remove these before simple BLAST is done against EST’s.

Summary Genes are complex structures which are difficult to predict with the required level of accuracy/confidence Different approaches to gene finding: Ab Initio : GenScan Ab Initio modified by BLAST homologies: GenomeScan Homology guided: GeneWise

Future Directions Find genes not for proteins (tRNA, rRNA, smRNA) – hard ! Deal better with overlapping genes, multiple genes in a single sequence Alternative splicing/transcription/translation – a whole separate issue The mechanisms governing it, the signals predicting the various genes Very important !