Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Toy Exon Finder.

Similar presentations


Presentation on theme: "The Toy Exon Finder."— Presentation transcript:

1 The Toy Exon Finder

2 The “Toy” genome The genome is very dense with genes, typically no more than 20 bp between successive genes on a chromosome The exons tend to be very small, on the order of 20 bp The introns in multi-exon genes tend to be very small, about 20 bp The exons incorporate fairly strong codon biases and a significant GC bias The splice sites, start codons, and stop codons are flanked by positions with fairly strong base composition biases The codon usage and base composition statistics can be well characterized with some sample data Genes occur only on one strand, which we will call the Forward strand

3 Toyscan 1 Use GC content to find exons
Find all ORFs such that each ORF either Begins with a START and ends with a STOP Begins with a START and ends with a GT Begins with AG and ends with GT Begins with AG and ends with STOP Set threshold t such that if an exon has GC content below t, label it as noncoding For all remaining pairs of ORFs p1, p2, do: If p1 and p2 overlap, then discard with ORF with lower GC content Output all ORFs that remain, calling them exons

4 Toyscan 2 Use codon bias to find exons
Codon frequencies for “true” exons are assumed to be known Stop codons not included so they have probability 0 Define codonBias function: For an input ORF (given), score all 3 frames Ignore the fact that some frames have stop codons in them Score = sum of log probabilities of all codons in that frame Probabilities are taken from the “known” probabilities Divide Score by number of codons n. This normalizes it. Output the highest score of the 3 frames

5 Toyscan 2 (cont.) Note: the codonBias function achieves its maximum when the observed distribution within an ORF matches the “correct” distribution from real genes Define TOYSCAN_2 as: A codon bias score threshold, t, is input For all ORFs, score them with the codonBias function If the score is < t, delete the ORF For all remaining pairs of ORFs p1, p2, do: If p1, p2 overlap then discard the ORF with the lower codonBias score Output all remaining ORFs as exons

6 Toyscan 3 Use codon bias and weight matrix models (WMMs)
Input includes WMMs for start, stop, donor, and acceptor sites Donor WMM includes 5 positions after GT Acceptor WMM includes 5 positions before AG Start codon WMM includes 5 positions before ATG Stop codon WMM includes 5 positions after TAA/TGA/TGA

7 Toyscan 3 (cont.) Score a weight matrix (scoreWMM):
For each position i in the sequence S, sum the log probabilities of the bases in the interval (i,j) using the WMM, where j-i+1 is the width of the WMM Score an ORF (scoreORF): choose the matrices to use on the left and right ends of the ORF E.g., internal exon has acceptor on left, donor on right Score = WMM(left end) + WMM(right end) + codonBias return Score

8 Toyscan 3 (cont.) Now define Toyscan_3 as:
assume a scoring threshold, t, is provided You will have to experiment to find a good value for t Get all ORFs Score all ORFs using the scoreORF procedure If the score is < t, delete the ORF For all remaining pairs of ORFs p1, p2, do: If p1, p2 overlap then discard the ORF with the lower scoreORF score Output all remaining ORFs as exons

9 GFF format # coding GC: 49% # noncoding GC: 50%
1 toy-genome initial-exon transgrp=1; 1 toy-genome final-exon transgrp=1; 1 toy-genome single-exon transgrp=2; 1 toy-genome single-exon transgrp=3; 1 toy-genome single-exon transgrp=4; 1 toy-genome single-exon transgrp=5; 1 toy-genome single-exon transgrp=6; 1 toy-genome single-exon transgrp=7; 1 toy-genome single-exon transgrp=8; 1 toy-genome initial-exon transgrp=9; 1 toy-genome internal-exon transgrp=9; 1 toy-genome final-exon transgrp=9;


Download ppt "The Toy Exon Finder."

Similar presentations


Ads by Google