Presentation is loading. Please wait.

Presentation is loading. Please wait.

Table 1. Assessment of EuGene prediction results actual genes correct gene models partial gene models split genes missing genes actual exons missing exons.

Similar presentations


Presentation on theme: "Table 1. Assessment of EuGene prediction results actual genes correct gene models partial gene models split genes missing genes actual exons missing exons."— Presentation transcript:

1 Table 1. Assessment of EuGene prediction results actual genes correct gene models partial gene models split genes missing genes actual exons missing exons missing exons in 5' missing exons central missing exons in 3' wrong exons Plant- Gene 238182 (76%) 50 (21%) 5 (2%) 1 (0.5%) 163951 (3%) 33 (2%) 12 (0.7%) 6 (0.4%) 1 (0.06%) Araset 5137 (67%) 14 (27%) 0025415 (6%) 8 (3%) 5 (2%) 2 (0.8 %) 1 (0.4%) References [1]The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 - 815. [2]Aubourg S and Rouzé P (2001). Genome annotation. Plant Physiol. Biochem. 39, 181-193. [3]Schiex T, Moisan A and Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence. In O. Gascuel, M.F. Sagot (Eds.) : JOBIM 2000, Montpellier, 111-125. [4]Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DV, Leroy P and Rouzé P (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899. [5] Seki M et al (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science 296, 141- 145. [6]Thareau V, Déhais P, Rouzé P and Aubourg S (2001) Automatic design of gene specific tags for transcriptome studies. In L. Duret, C. Gaspin, T. Schiex (eds) : JOBIM 2001, Toulouse, 195-196. Figure 1. Gene identification using EuGene EuGène genomic fragment genes RepeatMasker Blastn Blastx NetstartNetGene2 SplicePredictor Now that the Arabidopsis genome has been fully sequenced [1], biologists would like to decipher where are the genes in this sequence, through annotation, and which role they are playing, through functional genomics. Microarrays are privileged tools to analyze the transcript profile of genes for whole genomes, including Arabidopsis. The principle of such experiments is to spot DNA on a support (one spot per gene), and to hybridize it with labelled DNA copies of the transcript population, extracted from the organism under the various conditions under study. This allows to analyze the expression of thousands genes at one and the same time. Currently, the source of spotted DNA is cDNAs and ESTs. The first limitation of EST- based microarrays is the access to and maintenance of such resources which anyway only represent a fraction of the whole genome. The second, and maybe more important limitation is that, as most Arabidopsis genes belong to gene families (sequence-related genes), using ESTs ends up in cross-hybridization with other members of the family. The CATMA project was initiated in order to by-pass these drawbacks. The aim of this project is indeed to generate a collection of specific Gene Sequence Tags (GSTs) for every Arabidopsis genes. The GSTs are amplicons of individual genes (150bp to 500bp-long), obtained from genomic DNA over transcribed regions selected for their uniqueness in the genome. These GSTs are now produced and will be utilized for transcript profiling experiments on microarrays by the eight european partners from the CATMA consortium. It can also be used for other experiments based on nucleic acid hybridization, such as RNAi. Introduction Design and Databasing for the Complete Arabidopsis Transcriptome MicroArray (CATMA) project C. Serizet, V. Thareau, S. Aubourg, M. Crowe, P. Hilson and P. Rouzé Structural annotation of Arabidopsis genome As only a small fraction of Arabidopsis genes are experimentally documented, genome annotation still relies on gene prediction to identify the exons and borders of the genes. The AGI annotation of the Arabidopsis genome is publicly available [1]. This annotation was done quickly by different centers, using different processes, resulting in errors and heterogeneity [2]. Consequently, we have done a new annotation of the Arabidopsis genome, using EuGene (Figure 1) [3], a software proven to be very efficient in exon finding and gene modeling in Arabidopsis using a previously described validation (Table 1) [4]. We thereby identified more than 29000 genes, which is higher than the 25470 identified by the AGI (Figure 2). As a big amount of new expression data was recently released [5], we are currently updating the structural annotation. III) These primers are tested for specificity with BLASTn against the template DNA (e.g. BACs) containing the gene and are excluded if matches indicate potential unwanted PCR amplification. IV) Each successive amplicon is tested with BLASTn to determine its specificity. If the identity with putative paralogous sequence is over 70%, the amplicon is removed and the next one is processed. Specific GSTs are searched from 3’ to 5' until one is found. Up to now, a specific GST was designed for 75% of the genes (Figure 4 for characteristics), a second run being under way in order to cover the remaining fraction of the genes. Figure 2. Gene density according to the Eugene and AGI annotations The CATMA database A database containing the data from the CATMA project has been built under MySQL and is maintained at the John Innes Centre. This database contains all the information on the GSTs and their associated genes (with links to EuGene and AGI entries), such as genomic and transcript sequence data, chromosomal location, gene structure, primers sequences (Figure 5). A web interface allows searching by gene name, GST name or spot coordinate on the microarray. Genes and GSTs can also be identified by homology searches. More advanced searches can also be done by SQL queries through the web. The database and other information about the synthesis of the GSTs will be put at http://www.catma.org on June 21 st 2002. Automated design of GSTs The design of a collection of specific Gene Sequence Tags (GSTs), was done using the SPADS software [6] which generates gene-specific probes. Briefly, the procedure followed by this software is the following (Figure 3): I) Search for the most specific region within each gene. Each exon is tested with BLASTn against the whole genome sequence (“GST specificity database” in Figure 3), and fragments with hits are removed. Primer pairs are designed in the remaining regions. If none are detected, the mismatch parameter of BLASTn is decreased and only fragments with stringent hits are substracted, thus enlarging the specific remaining regions for primer design. II) The specific regions are used as input for the Primer3 software to design the two primers for each GST. Figure 4. GST characteristics A. Distribution of GST lengthsB. Position of GSTs 150-200 bp: 42% 200-300 bp: 36% 300-500 bp: 22% ATGstop 5’ center3’ 3267 (16%) 5115 (24%) 12701 (60%) UTRCDS Figure 3. GST design using SPADS Input Sequence + exons coordinates Checking amplicon Specificity Define Specific Region BLASTn High specific GST BLASTn No specific GST found BLASTn exon higher stringency no yes Medium specific GST no Primer3 Testing primer pairs specificity yes complete_sequence #id actual_sequence sequence_length gst_location gst_type gst_gc gst_intron gst_homology gst_start gst_stop 96_plate_code 96_coords 384_plate_code 384_coords gene_name gene_sequence chromosome chr_start chr_stop amplification_results sequence_verified model_type agi _match agi_function @#gel_id gel_coords @#primer5_id @#primer3_id @#contact_id @#bac_id embl_id comments short_comments design bugs contact #id last_name first_name type lab department organization street city province_state country postal_code phone fax email primer #id sequence length gc tm start stop @#extension_id extension #id sequence gel #id image bac #id chromosome start stop length embl_id Figure 5. Diagram of the CATMA database BLASTn exon low stringency


Download ppt "Table 1. Assessment of EuGene prediction results actual genes correct gene models partial gene models split genes missing genes actual exons missing exons."

Similar presentations


Ads by Google