Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Eukaryotic Open reading frames.

Similar presentations


Presentation on theme: "Finding Eukaryotic Open reading frames."— Presentation transcript:

1 Finding Eukaryotic Open reading frames.

2 Introduction The open reading frame: (OFR) in prokaryotics DNA (test your application) The structure of the Eukaryotic gene. Findings gene in Eukaryotes ORF and problems with ORF First exon and first intron Distinguish introns/exons (splice sites) Proximity of promoters (mentioned) Bases pair patterns Homology with existing sequences.

3 ORF prokaryotics (Pal Gene E.Coli)
Adapted Understanding bioinformatics 9.3

4 Using your assignment code
Open the file: ORF pal gene.fasta Find all open reading frames. (This time you must modify your code to translate each codon, copy form convertor_hashtable.txt Compare to file: pal protein sequence.fasta. Visual inspect the files. What conclusion can you draw. On which read frame is the true ORF.

5 Structure of eukaryotic “gene
The “basic” transcription/translation of Eukaryotic gene An ORF in Eukaryotic is a region of the DNA which “could be” a coding sequence (CDS) of a gene. It has a start codon (ATG) and an end codon [ one of three] (TAA, TAG, TGA) The diagram shows the DNA sequence of an eukaryotic gene including promoter, UTR…. Eukaryotic expression showing exons/ introns…, adapted from Zhang 2002

6 Structure of Eukaryotic CDS
ALDH10 gene exon 1 shows a 5’UTR in exon

7 ORF in Eukaryotes In comparison to prokaryotes Eukaryotic DNA is :
Gene density is much lower; genes are further apart and can vary significantly between chromosomes (~ 1.5% of human DNA is CDS). The mRNA is monocistronic (one promoter per gene; N.B prokaryotes generally are organised in operons) moreover; A DNA sequence is transcribed into one mRNA sequences [this may not be true of viral DNA] The “ORF” in the DNA sequence contains exons (translated sequences (CDS) or exon and introns (Non CDS). The Introns are spliced to leave only exons Global Sequence

8 ORF in Eukaryotes Some Impact of differences:
You can no longer reliably translate an ORF into AA sequence to give you the “true” protein (amino acid) sequence. The DNA sequence of Introns is like any non coding region of the DNA in that the bases are just bases and should not be read as elements of a “codon” so in an intron sequence: ATG does not represent a start codons TAA/TAG/TGA do not represent stop codons. Increase the complexity of determining true ORF in DNA sequences because of “false positive” start and stop codons in introns; thus many attempts to find genes/orf is now down via mRNA (not pre-mRNA) The size of introns and exons need not be multiple of three. The impact of this on the DNA ORF analysis is “shift” the DNA reading frames.

9 Figure 9.2a the CDK10 gene Note in this ATG is shown in a red box (note it is 12 nucleotides into the first exon) {this will not impact on ORF but will mean an incorrect gene annotation: why) Only the first exon and intron is shown full; the rest shows partial sequences of introns are shown. The fully sequences can be found in the PISSRLE DNA sequence.

10 Frame shits using ORF technique
When a pre-mRNA is splice into mRNA all the exons will have to be in one reading frame. However: the splice sites need not occur at the beginning or start of exons. introns need not be multiples of three in size; What is the net effect of this in terms of trying to “predict” translation of proteins using DNA sequences.: In can affect the Translation of an exon; It can affect the starting residue of the following exon…. It can mean the “translation” of an exon(s) are being carried out in the incorrect reading frame. The effect of manual translation It is dependent on the starting position of the exon (correct reading frame) , the length of the exon and the length of the intron. Refer to chapter 9 understanding bioinformatics

11 Predictive translation effect Exons/intron length
Consider the following: We have the mRNA CDS of 60 bp in length (start…stop) Let us assume that the intron1 is: at the end of codon three (position 9) the length of the intron is 30bp. Intron 2 occurs at: the end of codon 10 (position 30) and is 45 bp in length What is the effect of the translations: on Exon A and Exon B? Exon A Exon B ATG TAA Exon 9 bp Intron 30 bp Exon 21 bp Intron 45 bp Exon 30 Bp BP… DNA Strand

12 Predictive translation effect Exons/intron length
Consider the following: We have the mRNA CDS of 60 bp in length (start…stop) Let us assume that the intron1 is: at the end of codon three (position 9) the length of the intron is 30bp. Intron 2 occurs at: at position 29 (at the 3rd bp of codon 10) and is 45 bp in length What is the effect of the translations: on Exon A and Exon B? Exon A Exon B ATG TAA Exon 9 bp Intron 30 bp Exon 20 bp Intron 45 bp Exon 31 Bp BP… DNA Strand

13 Predictive translation effect Exons/intron length
Consider the following: We have the mRNA CDS of 60 bp in length (start…stop) Let us assume that the intron1 is: at the end of codon three (position 9) the length of the intron is 30bp. Intron 2 occurs at: at position 30(the end of codon 10) and is 43 bp in length What is the effect of the translations: on Exon A and Exon B? Exon A Exon B ATG TAA Exon 9 bp Intron 30 bp Exon 21 bp Intron 43 bp Exon 30 Bp BP… DNA Strand

14 Effect of Translation Example 1 no effect all multiples of 3
Example 2 the last residue of exon 2 is incorrect. The residue for Exon 3 is correct. (but starts at bp 2 of first codon) Example 3 last expn is in different reading frame. Refer to Incorrect_translation_examples.rar

15 Predicting exons ADH10 gene
The diagram shows the exons 1 and exons 2 for ALDH10 gene. The correct coding sequence is shown in upper case. : the second ATG is the actual start codon The sequences can be found in the sample sequence files. What is the length of each exon (CDS). Consider what may happened if you applied a translation to each of the reading frames? Exon 1 is position ; exon 2 is The position of the actual ATG is 1610 Figure 9.7 : understanding Bioinformatics

16 Finding Exons Coding regions
In order to ensure you can negate the previous issue(s) it is imperative to identify splice sites: Identify start and stop signals (refer to Zhang 2002 Chasin 2007) Initial exon (start and 5’ splice site) Internal exon (3’ and 5’ site) Terminal site (3’ and and stop codon) Identify splice junctions: the 5’ splice junction is in general GT) The 3’ splice junction is in general AG. Refer to Exon 1 and Exon 2 in the ADH10 gene in previous slide: Exon 1 is position ; exon 2 is The position of the actual ATG is 1610 ( : is the 5’ UTR of exon 1; Translation initiation site) Global Sequence

17 Splice site prediction
While GT and AG are the general 5’ and 3’ splice; it is obvious that such pairings are not uncommon: in fact there is a high degree of false positives (understanding bioinformatics p. 392). Figure 9.10 understanding bioinformatics: spliceview… are prediction programs.

18 Proximity of promoters
Basically a true CDS (ORF) will have a promoter region near by : Promoters in prokaryotes have well defined b.p. sequences (motifs) upstream of the CDS (true ORF): The Pribnow box: TATAAT at about position -10 ATTGACA at position -35 An AT rich region before this box. Eukaryotic promoters are more complex: there is more than one… TATA box CAAT box GC rich regions Conversely the presence of a ORF indicates that there should be a promoter close by. (Bioinformatics 1 will cover promoter prediction in greater detail in the next lecture)

19 BP sequences in Exons/Introns
The DNA sequence of a gene’s CDS contains different ratio of bases as opposed to the non “CDS of a gene” or non geneic DNA. (The student is expected to research this) So the ratio of BP to each other and specific BP sequences is different between Exons/Introns and other non coding DNA. (remember the non CDS there are no codons) If student requires greater, supplementary material, detail it can be found in Zhang et al and other references at the end of chapter 9 and 10 in understanding bioinformatics

20 Homology in coding regions
The CDS sequence of genes are generally highly: Hypothesis why this is the case? Like prokaryotic DNA the CDS sequence is highly conserved so database searches can facilitate determining exons and thus ORF. By extracting a possible exon region. It can be submitted to a search for similar sequences (BLAST search) to see what it may reveal. If there are highly probable similarity existing exons then it is likely to be a true exon An exon can also be translated and homologs of the translated sequence can also be submitted to search (The SWISS-Prot blast search engine should be used as it contains experimentally determined AA sequences.)

21 Alternative splicing The diagram shows the main four types of alternative splicing. It clearly indicates that the pre-mRNA is not the same as the mRNA (so direct translation via the DNA is fraught with danger) Homological analysis and the use of expressed sequence tags (mRNA produced by genes over different times and different tissue types) can help determine the different splices Can you think of any issues that may arise, using ORF, if there is alternative splicing?

22 Reference Baxevanis, A.D Bioinformatics: a practical guide to the analysis of genes and proteins. Wiley; Chapter 5. [book is in the library] Klug, W.A. et al 2010; Concepts of Genetics; Pearson Education p. 596-p.597 Zhang, M.Q Computational prediction of eukaryotic coding genes. Nat Rev. Genet Chasin, L.A Searching for splicing motifs. Adv Exp Med Biol. 623:85-106 Zvelebil M. “understanding bioinformatics” chapter 9 {book is in the library] Global Sequence


Download ppt "Finding Eukaryotic Open reading frames."

Similar presentations


Ads by Google