Lecture 12 Splicing and gene prediction in eukaryotes
Published byModified over 4 years ago
Presentation on theme: "Lecture 12 Splicing and gene prediction in eukaryotes"— Presentation transcript:
1 Lecture 12 Splicing and gene prediction in eukaryotes BioinformaticsLecture 12Splicing and gene prediction in eukaryotesCritical splice signalsCoding statistics: DNA differences betweenexons and intronsDiscriminant function and combined approach
2 Splicing and gene prediction in eukaryotes Any type of gene prediction and particularly ab initio is tremendously complicated in eukaryotes by the splicing phenomenon.The task is difficult, to predict positions of exon-intron boundaries for those eukaryotic genes, which have multiple introns, and to predict absence of introns for intronless genes.Eukaryotic genomes differ significantly in a number of ways, which requires species specific prediction programs.The major differences include: a) variation in GC-content (e.g. mammalian genomes have large variation in GC-content, referred as isochors), b) variation in codon usage frequencies.All these factors, if not taken into consideration, diminish quality of prediction.
3 AT/GC ratios in coding regions in some eukaryotes
4 The number of correct and incorrect (number in parentheses) of whole gene model predictions shared among the 3 programs from a test set of 1783 genesGenMark.hmm(GM)Genscan+(GS)Incorrect gene refers to cases in which all coding exons in the gene are in perfect agreement among the gene finders but not with the true geneGlimmerM(GA)
6 Critical splice signals EXON 1 INTRON EXON 2G U A/G A G U U U A/G A U/C U/C A G(100%) ( 62 –68 %) (100%)A GDonor site5’ splice junctionAcceptor site3’ splice junctionBranch siteG/A
7 Frequencies of nucleotides at the ends of exons The first 10 nucleotides of exons, 5’ end The last 10 nucleotides of exons, 3’ endC. elegansD. melanogasterH. sapiens
8 Recognition of variable splice sites and gene prediction At least 3 critical signals/motifs (donor, acceptor and branch sites) should be recognised in order to predict position of an intron and both splice junctions.Significant sequence variation in these sites between species and different genes negatively affects quality of predictions.The best average of error (false-positive + false-negative) rate for either donor or acceptor site prediction is about 5%. This may be acceptable if the search is restricted by a short region. However search of a large region leads to unacceptable rate of the false-positive because for every true site there are hundreds of pseudo-sites.For example, if a large region has 40 true sites and 4000 pseudo-sites, one true site would be missed (2.5% false-negatives) and 100 pseudo-sites would be predicted as true sites (2.5% false-positives)!
9 Recognition of variable splice sites and gene prediction Since adjacent donor site and acceptor site are not independent, this correlation can be explored for further eliminating false-positives.For short introns, occurring mostly in lower eukaryotes, an intron is recognized by the interaction of splicing factors binding across the intron-ends (hence 5’ss – 3’ss correlation).In vertebrates, exons are much shorter, recognition of exons by the interaction of splicing factors binding across the exon-ends (hence 3’ss – 5’ss correlation) is the key.Therefore mammalian functional splice sites can only be effectively identified simultaneously through exon recognition.Also there are several additional signals/motifs essential for the correct splicing, which are responsible for recognition of certain proteins involved in splicing. Identification of such sites and their use in prediction programs should increase quality of eukaryotic gene predictions.
10 Coding statistics: DNA differences between exons and introns Except splicing signals and ORF there are several additional characteristics, which may help to discriminate between exons and introns includingThese features include DNA periodicity in exons, codon preferences, hexamer usage, codon prototype, compositional bias between codon positions
13 Periodic structure in DNA sequences. The absolute frequency of the A A pair with ( 0 to 5) nucleotides between the two A's in the 200 first base pairs of the sequences in the set of 1761 human exons and 1753 human introns. A clear period-3 pattern appears in coding regions, which is absent in non-coding regions. A similar periodic pattern appears in coding regions for the other fifteen possible pairs of nucleotides.
14 Codon PreferenceA coding statistic was introduced to measure uneven usage of synonymous codons solely.Indeed, from a codon usage table, we can compute the relative probability of each synonymous codon to code for a given amino acid.For instance, GAG and GAA the two codons coding for Glutamic Acid are used in coding regions with probabilities and , which results in a relative probability of 0.59 and 0.41, respectively.
15 Hexamer usage correlation Bias in the distribution of oligonucleotides longer than codons can also be used to discriminate between coding and non-coding regions. Bias in the usage of hexamers may be the most discriminant one (probably because of dependence between adjacent amino acids in the proteins). Bias in hexamer usage can be computed exactly as bias in codon usage as the background information for codon frequencies is known and frequencies of each of 642 = 4096 hexamers can be found.There are several ways to construct frame specific hexamer score, both log-odd LE(w,i) = log [fE(w,i)/fI(w)] and preference score PE(w,i) = fE(w,i) / [fE(w,i) + fI(w)], where fE(w,i) is frequency of hexamer w in frame i, calculated from known exon training data and fI(w) is the frequency of w from known introns.Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position. Estimated from a set of human exon and intron sequences.Codon position 1 A C G T A C G TCodon position 2 A C G TACGTCodon position 3 ACGT
16 Codon Prototype, Markov model measure and Average Mutual Information A measure can be introduced which show how similar to the prototypical distribution (see the table) is the observed distribution of base frequencies at the three codon positions in a sequence (exon or intron).Dependencies between nucleotide positions in coding regions can be explicitly described by means of Markov Models.Average Mutual Information can measure the probability in the sequence of the pair of nucleotides i and j and at a distance of k nucleotides.NucleotideCodon position123A0.270.310.18C0.24G0.320.200.29T0.170.260.22
17 Values of different coding statistics in the 223 bp long 2nd coding exon of the human -globin gene, and in a 223 bp long seq. from the middle of the 2nd intron of the same geneExon sequenceIntron sequenceCoding frameNon-coding framesFrame 1Frame 2Frame 3Codon Usage24.06-16.13-3.16-14.36-23.74-19.67Hexamer Usage27.62-11.64-6.51-20.90-27.56-22.0739.98-14.58-8.46-26.73-27.81-25.87Codon Preference15.97-1.327.24-7.96-12.70-14.93Amino Acid Usage8.17-14.87-10.17-6.15-10.69-4.57Codon Prototype9.87-11.23-10.30-11.45-17.44-14.49Markov Modelorder 129.92-2.69-3.31-35.44-42.40-41.73order 234.73-18.26-7.77-29.61-41.76-40.05order 572.69-21.3813.56-37.63-30.99-36.40Position Asymmetry0.09570.0211Periodic Asymmetry Index1.1591.009Average Mutual InformationFourier Spectrum2.2780.892
18 Pattern discriminant analysis A number of different pattern features of sequences are used to discriminate coding (ex) and non coding seq. A linear and quadratic analysis are shown with the later being more efficient. EPS is the 6-mer exon preference score and 3’SS (3’splicing site) is an exampleEPS
19 COMBINER computational gene prediction using multiple sources of evidence The next generation of computational method able to construct gene models is currently developed, which takes as input (combines) a genomic sequence and the locations of gene predictions from ab initio gene finders, protein sequence alignments, expressed sequence tag (EST) and cDNA alignments, splice site predictions, and other evidenceAn example of such program is COMBINER, which uses rigorous statistical assessments, evaluate candidate gene models and estimate probabilities using so-called decision trees.