Biological Motivation Gene Finding in Eukaryotic Genomes

Slides:



Advertisements
Similar presentations
Biological Motivation Gene Finding
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics and Gene Recognition CIS 667 April 27, 2004.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Molecular Genetics of Gene Expression
Gene Identification Lab
Transcription in eucaryotes The basic chemistry of RNA synthesis in eukaryotes is the same as in prokaryotes. Genes coding for proteins are coded for by.
Gene Finding Charles Yan.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
(CHAPTER 12- Brooker Text)
Eukaryotic Gene Finding
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Chapter 10 genome, gene expression; genes as units of inheritance transmission of heritable characteristics; gene regulation, eukaryote chromosomes, alleles.
Intelligent Systems for Bioinformatics Michael J. Watts
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription. Recall: What is the Central Dogma of molecular genetics?
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
HOW DO CELLS KNOW WHEN TO EXPRESS A GENE? DO NOW:.
Human Molecular Genetics Institute of Medical Genetics.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Eukaryotic Gene Structure
A Quest for Genes What’s a gene? gene (jēn) n.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Organisms are made up of cells, cells are largely protein and DNA carries the instructions for the synthesis of those proteins.
4. HMMs for gene finding HMM Ability to model grammar
credit: modification of work by NIH
Gene Structure.
The Toy Exon Finder.
Gene Structure.
Presentation transcript:

Biological Motivation Gene Finding in Eukaryotic Genomes Anne R. Haake Rhys Price Jones

Recall from our previous discussion of gene finding in prokaryotes: The major strategies in gene finding programs are to look for: Signals/Features Content/Composition Similarity to known genes (BLAST!)

3 Major Categories of Information used in Gene Finding Programs Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites Content/composition -statistical properties of coding vs. non-coding regions. e.g. codon-bias; length of ORFs in prokaryotes; CpG islands GC content Similarity-compare DNA sequence to known sequences in database Not only known proteins but also ESTs, cDNAs Similarity: -involves translation in all 6 possible reading frames. Need to discuss repetitive sequences & how to handle them. Can also use database information to aid in the prediction –not only known proteins but also ESTs, cDNAs (explain these).  

In Prokaryotic Genomes We usually start by looking for an ORF A start codon, followed by (usually) at least 60 amino acid codons before a stop codon occurs Or by searching for similarity to a known ORF Look for basal signals Transcription (the promoter consensus and the termination consensus) Translation (ribosome binding site: the Shine-Dalgarno sequence) Look for differences in sequence content between coding and non-coding DNA GC content and codon bias Yeast, ~1% of genes have ORFs<100 aa

The Complicating factors in Eukaryotes Interrupted genes (split genes) introns and exons Large genomes Most DNA is non-coding introns, regulatory regions, “junk” DNA (unknown function) About 3% coding Complex regulation of gene expression Regulatory sequences may be far away from start codon

Some numbers to consider: Vertebrate genes average about 30Kb long varies a lot Coding region is only about 1-2 Kb Exon sizes and numbers vary a lot Average is 6 exons, each about 150 bp long An average 5’ UTR is about 750 bp An average 3’UTR is about 450 bp (both can be much longer) There are huge deviations from all of these numbers e.g. dystrophin is 2.4 Mb long ; factor VIII gene has 26 exons, introns are up to 32 Kb (one intron produces 2 transcripts unrelated to the gene!) There are genes without introns: called single-exon or intronless genes

Eukaryotic Gene Structure www.bio.purdue.edu/courses/biol516/eukgenestructure.gif

Given a long eukaryotic DNA sequence: How would you determine if it had a gene? How would you determine which substrings of the sequence contained protein-coding regions?

In prokaryotic genomes we usually start by looking for ORFs. Is this a good approach for the eukaryotic genome? Why or why not?

So, what’s the problem with looking for ORFs? “split” genes make it difficult to define ORFs Where are the stops and stops? What problems do introns introduce? What would you predict for the size of ORFs? (you can’t with any certainty!)

Most Programs Concentrate on Finding Exons Exon: the region of DNA within a gene that codes for a polypeptide chain or domain Intron: non-coding sequences found in the structural genes Typically a mature protein is composed of several domains coded by different exons within a single gene.

Splice Sites used to Define Exons Splice donor (exon-intron boundary) and splice acceptor (intron-exon boundary) are consensus sequences A statistical determination of the pattern;approximates the pattern C(orA)AG/GTA(orG)AGT "donor" splice site T(orC)nNC(orT)AG/G "acceptor" splice site These sequences in eucaryotic organisms, ranging from yeasts to mammals, have common sequence motifs with introns beginnig with 5'-GU and ending with 3'-AG. The consensus 5' splice site of vertebrate introns is AGGUAAGU, while the consensus at the 3' splice site is a stretch of pyrimidines (U or C), followed by any base, then by C, and ending with the invariant AG To read more about splicing check out: http://www.biochem.arizona.edu/classes/bioc461/Biochem499/ScottHolata/splicing.htm#consensus Or http://opbs.okstate.edu/~melcher/MG/MGW2/MG2315.html

Gene finding programs look for different types of exon single exon genes: begin with start codon & end with stop codon initial exons: begin with start codon & end with donor site internal exons: begin with acceptor & end with donor terminal exons: begin with acceptor & end with stop codon

How are correct splice sites identified? There are many occurrences of GT or AG within introns that are not splice sites Statistical profiles of splice sites are used http://www.lclark.edu/~lycan/Bio490/pptpresentations/mutation/sld016.htm

Other Biologically Important Signals Used in Gene Finding Programs Transcriptional Signals Transcription Start: characterized by cap signal A single purine (A/G) TATA box (promoter) at –25 relative to start Polyadenylation signal: AATAAA (3’ end) Major Caveat: not all genes have these signals Makes it difficult to define the beginning and end of a gene

Upstream Promoter Sites Transcription Factor (TF) sites Transcription factors are sequence-specific DNA-binding proteins Bind to consensus DNA sequences e.g. CAAT transcription factor and CAAT box Many of these Vary in sequence, location, interaction with other sites Further complicates the problem of delineating a “gene”

Translation Signals Kozak sequence And of course.. The signal for initiation of translation in vertebrates Consensus is GCCACCatgG And of course.. Translation stop codons

Codon Bias in Eukaryotic Genomes Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each) Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)

GC Content in Eukaryotes Overall GC content does not vary between species as it does in prokaryotes GC content is still important in gene finding algorithms CpG Islands CG dinucleotides occur at low frequency overall in the genome Exception: CpG islands near promoters CG dinucleotides occur at level predicted by chance -1,500 to +500 (relative to transcription start site)

CpG Islands Occurrence related to methylation Methylation of C in CG dinucleotides Methylation of C makes CpG prone to mutation (e.g. to TpG or CpA) Level of methylation is low in actively transcribed genes Transcription requires a methyl-free promoter

Gene Finding Strategies Homology-based approach Find sequences that are similar to known gene sequences ab initio-based approach is to identify genes by: Signal sequences Composition

List of Gene Finding Programs http://www.hku.hk/bruhk/sggene.html

Homology-Based Approaches in Eukaryotic Genomes More complicated than prokaryotes due to split genes Genome sequence -> first identify all candidate exons Use a spliced alignment algorithm to explore all possible exon assemblies & compare to known e.g. Procrustes Limitations: must have similar sequence in the database with known exon structure Sensitive to frame shift errors

Procrustes Gene Recognition via spliced alignment Given a genomic sequence and a set of candidate exons, the spliced alignment algorithm explores all possible exon assemblies and finds a chain of exons with the best fit to a related target protein http://hto-13.usc.edu/software/procrustes/#salign

GenScan Allows integration of multiple types of information Earlier programs considered features of gene structure in isolation Uses a generalized HMM (one state might use a weight matrix model, another an HMM) http://genes.mit.edu/GENSCAN.html GenScan efficient only with human DNA because uses human genes as training set for the HMM The states correspond to the different functional units on a gene The transitions between the states ensures that the order is biologically consistent. E.g. promoter must move to the state corresponding to the 5’UTR.

GenScan Probabilistic Model of Genes Accounts for many of the known structural & compositional properties of genes including: typical gene density typical number of exons per gene distribution of exon sizes for different types of exon compositional properties of coding vs. non-coding translation initiation (Kozak) termination signals TATA box, cap site and poly-adenylation signals donor and acceptor splice sites

GenScan Uses as a training set 238 multi-exon genes and 142 single-exon genes from GenBank to compute parameters Initial state probabilities Transition probabilities State length distributions

GenScan Probabilistic models for the states The states correspond to different functional units on a gene e.g promoter regions, exon Transitions ensure that the order that the model marches through the states is biologically consistent Length distributions take into account that different functional units have different lengths.

GenScan Signal models used by GenScan - WMM= weight matrix model for transcriptional and translational signals (translation initiation, polyadenylation signals, TATA box etc.) e.g. polyadenylation signal is modeled as a 6 bp WMM with AATAAA as the consensus sequence -WAM= weight array model; assumes some dependencies between adjacent positions in the sequence e.g. used for the pyrimidine-rich region and the splice acceptor site -Maximal dependency decomposition e.g. used for donor splice sites

GenScan does not use similarity search uses double stranded genomic sequence model potential genes on both strands are analysed simultaneously Limitations: cannot handle overlapping transcription unit does not address alternative splicing

GRAIL GRAIL (Gene Recognition and Assembly Internet Link) uses a number of sensor algorithms to evaluate coding potential of a DNA sequence features include 6-mer composition, GC composition and splice junction recognition the output of the sensor algorithms is input to a neural network, which uses empirical data for training. GRAIL provides analysis of protein coding potential of a DNA sequence. GRAIL uses variable-length windows tailored to each potential exon candidate, defined as an open reading frame bounded by a pair of start/donor, acceptor/donor or acceptor/stop sites. This scheme facilitates the use of more genomic context information (splice junctions, translation starts, non-coding scores of 60-base regions on either side of a putative exon) in the exon recognition process. GRAIL finds about 91% of all coding regions with an apparent false positive rate of 8.6%.

GRAIL-exp http://compbio.ornl.gov/grailexp/gxpfaq1.html