Presentation on theme: "Biological Motivation Gene Finding"— Presentation transcript:
1Biological Motivation Gene Finding Anne R. HaakeRhys Price Jones
2Gene FindingWhy do it?Find and annotate all the genes within the large volume of DNA sequence datahow many genes in an organism? homologies?Gain understanding of problems in basic sciencee.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc?Different emphasis in these goals has some effect on the design of computational approaches for gene finding.
3Gene Finding by Biological Methods: Extract mRNA reversetranscribe cDNALabel cDNADetecting by using cDNA probeGene foundDNA library
4Gene Finding by Computational Methods Dependent on good experimental data to build reliable predictive modelsVarious aspects of gene structure/function provide information used in gene finding programs
5Figure 12.3 Figure 12.3 In prokaryotes, these processes are coupled In Eukaryotes, these processes are physically separated and there are more steps!Figure 12.3
6The Informatics View of Genes Genes are character strings embedded in much larger strings called the genomeGenes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.
7Gene Finding Cells recognize genes from DNA sequence find genes via their bioprocessesNot so easy for us..
8CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...Even when entire genome is sequenced much work remains to make sense of the sequenceMay appear simple but is notFor example, where do genes begin? Where do they end? how do we find individual changes in sequence that are meaningful?Some aren’t: because not all sequences contain coding information for amino acids and proteins;Even in coding sequences some differences are tolerated. Is like finding a needle in a haystackExample here of gene, that when mutated, is responsible for cystic fibrosis
10Types of Genes Protein coding RNA genes most genes rRNA tRNA snRNA (small nuclear RNA)snoRNA (small nucleolar RNA)snRNAs: small nuclear RNAs; usually associated with proteins and are then known as SNRNPs or snurpsEx: those involved with splicingsnoRNAs: small nucleolar RNAs: involved in processing of rRNAs; perhaps in ribosome assemblyPolI: most rRNAs; snoRNAsPolII: mRNAs, snRNAsPolIII: tRNAs, 5srRNA
113 Major Categories of Information used in Gene Finding Programs Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islandsContent/composition -statistical properties of coding vs. non-coding regions.e.g. codon-bias; length of ORFs in prokaryotes;GC contentSimilarity-compare DNA sequence to known sequences in databaseNot only known proteins but also ESTs, cDNAsSimilarity: -involves translation in all 6 possible reading frames. Need to discuss repetitive sequences & how to handle them. Can also use database information to aid in the prediction –not only known proteins but also ESTs, cDNAs (explain these).
12Looking for Protein Coding Genes Look for ORF (begins with start codon, ends with stop codon, no internal stops!)long (usually > aa)If homologous to “known” protein more likelyLook for basal signalsTranscription, splicing, translationLook for regulatory signalsDepends on organismProkaryotes vs EukaryotesVertebrate vs fungiYeast, ~1% of genes have ORFs<100 aa
13Easier problem: Gene Finding in Bacterial Genomes Why?Dense GenomesShort intergenic regionsUninterrupted ORFsConserved signalsAbundant comparative informationComplete Genomes available for many
14What do Prokaryotic Genes look like? 5’3’Open Reading FramePromoter region (maybe)Ribosome binding site (maybe)Termination sequence (maybe)Start codon / Stop Codon
16Open Reading Frame (ORF) Any stretch of DNA that potentially encodes a proteinThe identification of an ORF is the first indication that a segment of DNA may be part of a functional gene
17Open Reading Frames A C G T A A C T G A C T A G G T G A A T Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. There are three different reading frames in the 5’->3’ direction and a further three in the reverse direction on the opposite strand.A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF)A C G T A A C T G A C T A G G T G A A TCGT AAC TGA CTA GGT GAAGTA ACT GAC TAG GTG AAT
18ORFs as gene candidates An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent)Most prokaryotic genes code for proteins that are 60 or more amino acids in lengthThe probability that a random sequence of nucleotides of length n has no stop codons is (61/64)nWhen n is 50, there is a probability of 92% that the random sequence contains a stop codonWhen n is 100, this probability exceeds 99%
19Codon Bias Genetic code degenerate Codon usage varies Biological basis Equivalent triplet codons code for the same amino acidCodon usage variesorganism to organismgene to geneBiological basisAvoidance of codons similar to stopPreference for codons that correspond to abundant tRNAs within the organism
21Codon Bias Organism differences Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)Complete set of codon usage biases can be found at:
22GC contentGC relative to AT is a distinguishing factor of bacterial genomesVaries dramatically across speciesServes as a means to identify bacterial speciesFor various biological reasonsMutational bias of particular DNA polymerasesDNA repair mechanismshorizontal gene transfer (transformation, transduction, conjugation)
23GC ContentGC content may be different in recently acquired genes than elsewhereThis can lead to variations in the frequency of codon usage within coding regionsThere may be significant differences in codon bias within different genes of a single bacterium’s genome
24Ribosome Binding Sites RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome)Usually found within 4-18 nucleotides of the start codon of a true gene
25Shine-Dalgarno Sequence Is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs.This sequence serves as a binding site for ribosomes and is thought to influence the reading frame.If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.
26Not so simple: remember, these are consensus sequences Bacterial Promoter-35T82T84G78A65C54A45…(16-18 bp)…T80A95T45A60A50T96…(A,G)Not so simple: remember, these areconsensus sequences
27Termination Sequences 3’-U tailStem/loopInverted repeat immediately preceding the runs of uracilTermination sequence