Presentation is loading. Please wait.

Presentation is loading. Please wait.

ANEXdb: An Integrated Animal ANnotation and Microarray EXpression Database Oliver Couture 1,2, Keith Callenberg 2,3#, Neeraj Koul 4, Sushain Pandit 4,

Similar presentations


Presentation on theme: "ANEXdb: An Integrated Animal ANnotation and Microarray EXpression Database Oliver Couture 1,2, Keith Callenberg 2,3#, Neeraj Koul 4, Sushain Pandit 4,"— Presentation transcript:

1 ANEXdb: An Integrated Animal ANnotation and Microarray EXpression Database Oliver Couture 1,2, Keith Callenberg 2,3#, Neeraj Koul 4, Sushain Pandit 4, Remy Younes 4, Zhi-Liang Hu 2, Jack Dekkers 1,2, James Reecy 1,2, Vasant Honavar 4, Christopher Tuggle 1,2 1 Interdepartmental Genetics - Iowa State University, Ames, Iowa 50011, USA, 2 Department of Animal Science - Iowa State University, Ames, Iowa 50011, USA, 3 Department of Computer Science - San Jose State, San Jose, California 95192, USA, 4 Department of Computer Science - Iowa State University, Ames Iowa 50011, USA, # Joint Carnegie Mellon University, University of Pittsburgh Program in Computational Biology - Pittsburgh, PA 15260, USA Introduction Microarray and other high throughput assays generate mass amounts of data Some species, such as the pig, have little direct sequence annotation This impairs the full utilization of some high throughput methods We can leverage the direct annotations of more established species, such as human or mouse, to assist in annotation of the poorly annotated species Sequence homology can map homologs together For more accurate homology alignments, we need full-length sequences Most sequence data is short EST reads with few full-length clones We need to assemble EST sequences to get more complete cDNA sequences Assembled sequences can be mapped to assay elements This allows transfer of functional information from well-annotated homologs to array elements for a fuller understanding of different experimental conditions ExpressDB and anexDBO Currently supports Affymetrix GeneChips® Assembly Completeness and Annotation Validation Compared coverage of the human and mouse RefSeq RNA databases by the IPA to the coverage of human against mouse RefSeq RNA and mouse against human RefSeq RNA BLASTN with cuttoff of 1e-10 The IPA covers as much of the human RefSeq database as does the mouse (72.4% vs. 73.4%) and almost covers the mouse RefSeq database as well as human (67.5% vs. 73.5%). Compared the Gene ID of the top human hits across both BLASTs and Exonerate 81.7% of the BLASTN and BLASTX alignments had the same Gene ID as their top score 80.1% of the Exonerate hits agreed with either of the BLAST results with having the same Gene Id as their top human hit AnnotDB Assembly Construction and Annotation Schema of ANEXdb Arrows indicate direction of the data flow Levels show what language or host is controlling the flow Users submit to the temp database via a web interface They can also define their submission as public or private ExpressDB is queried using a web interface AnnotDB is queried through either a web interface or a guest account giving direct access for more advanced queries Administrator uses either a web based interface, scripting, or direct access to migrate (Submission to ExpressDB) or upload (AnnotDB) data within ANEXdb Query ExpressDB Expression Data AnnotDB Sequence Information User anexDBO Submission Temp Administrator Submit PHP Apache MySQL Java PHP Perl R Java object model of the database Migrates experimental information Integrates with Velocity to create output files Currently only SOFT for GEO submission ExpressDB Expression Data Submission Temp anexDBO Admin PHP Interface Bioconductor.cel Files Ability to query via web SOFT file output via Velocity template Calculates MAS5 and RMA signals for storage TRACE ARCHIVE 1,035,200 sequences dbEST 1,475,958 sequences dbCore 18,157 sequences SeqClean UniVec and LINE/SINE 2,369,608 after removal of undesired sequence 1,144,310 trimmed TGICL tclust and sclust CAP3 140,087 consensus sequences 103,888 singletons Sequence Variation BLAST RefSeq RNA and Protein Used NM numbers to map to NCBI Gene ID Pfam Affymetrix probe set Target Sequences Funcational Data Gene: GO and KEGG Pfam: GO Affy: Expression Cuttoffs: Nuc 1e-10 Pro 1e-5 SNP Validation Iowa Porcine Assembly (IPA) Iowa Tentative Consensus (ITC) Exonerate Used human chromosomes e2g model, > 60% of seq in alignment, > 1 HSP with score > 100 32,990 128 10,531 9,629 15,787 69,351 6,581 BLASTN  Total: 128,659 Exonerate  Total: 33,027 BLASTX  Total: 53,278 Used dbSNP at NCBI to verify predicted SNPs by mining the CAP3 output files for the ITCs Aligned ITCs to submitted sequences to porcine SNPs in dbSNP Split dbSNP into three groups: total (34,508 SNPs), genomic (33,863), and cDNA (645) Tested all three Calculated position of reported SNP in the alignment Looked up ITC in AnnotDB to determine if SNP was predicted at the same location 4,244 (12.3%) of all porcine SNPs in dbSNP were found in the ITCs when requiring > 60% of the dbSNP sequence aligned to the ITC, and with at least 2 counts of the minor allele Of the cDNA annotated SNPs, 161 (25.0%) are found at the same strigency At the least stringent, 303 (47.0%) are found in the ITCs Affymetrix GeneChip® Annotation Affymetrix target sequences were aligned to the IPA which allows transfer of IPA annotation to the Affymetrix probe set represented by the target sequence At an E-value < 1e-5 (min score of 50), 22,568 probe sets had a hit to and IPA, of these 19,253 had a hit to a RefSeq with an E-value < 1e-10 (min score of 74) In comparison, BLASTing the target sequences against RefSeq RNA at an E-value < 1e-10 resulted in 17,960 probe sets having an alignment to at least one RefSeq sequence The IPA also had a higher average score than the direct BLAST of the Affymetrix target sequences: 1,244 vs. 392 respectively, indicating more reliable BLAST results Available Annotation in AnnotDB AnnotationTotal Number of Annotations Total Number of ITCs with Annotation Distinct Sequence Number Distinct ITC Number BLASTN to RefSeq RNA 6,662,4513,523,626184,008101,132 BLASTX to RefSeq Protein 7,568,5373,316,19971,01430,877 BLASTX to Pfam 6,716,6633,364,10876,38540,813 Exonerate71,11633,01634,57322,083 Associated GO Terms 26,234,1237,721,429236,26986,337 Associated KEGG Pathways 11,993,3354,507,027127,57448,763 Putative Variation 2,528,653 63,995 ORF1,200,483723,047227,954127,978 Variation Type SNPDeletionInsertionITC Count Minimum Number of Minor Allele Counts 12,004,432428,63714,53663,995 2628,944116,6531,66232,235 3267,96854,91557616,253 4171,03434,09330912,263 5117,49623,3032109,378 688,59417,1141427,796 769,59813,144986,598 856,64010,407735,706 947,0848,475585,014 1039,9437,053494,441 Breakdown of the source of annotation and the number of that kind of annotation Almost every sequence (96.8%) has some form of GO annotation, and over half (52.3%) have a KEGG pathway associated with them from transferred Gene IDs from the BLAST to RefSeq ORFs were predicted by using NCBI’s ORF Finder program Variation broken down into type and minimum number of reads of the minor allele Deletion is when the nucleotide is deleted from the consensus sequence Insertions are when consensus sequence has a base some contributing sequences lack Discussion While ExpressDB is currently set up for Affymetrix style data, it can be customized to add additional platforms, such as two color array data The Velocity template system can also be customized to output in a user defined template This can include XML based MINiML for ArrayExpress submission We found a lower percentage of overall singletons (4.3%) than the Dana Farber Gene Index (DFGI, 10.0%) and the Sino-Danish (SD, 7.2%) assemblies The DFGI has a similar percentage of human singletons as in their pig assembly: 10.1% We have fewer singletons remaining than DFGI, but more than SD: 103,888 (AnnotDB), 133,182 (DFGI), 73,131 (SD) Likely due to large number of starting sequences being able to bridge gaps IPA is as complete as human and mouse RefSeq databases, relative to the other well annotated species, though future assemblies will be useful to refine consensus sequences SNPs found in the ITCs can currently be mapped to the Affymetrix GeneChip® Non-synonymous or nonsense SNPs could help explain difference in gene expression level (eQTL studies); any SNP in a probe sequence could affect hybridization to probe More of dbSNP was not found due to not all variants being submitted outside of dbSNP Submitters can submit to only dbSNP so variant sequence was not in one of the other sequence databases ITC SNPs with only one minor allele count could also be real If variant submitted outside of dbSNP, a single exemplar sequence was most likely submitted Through leveraging well annotated species and using species specific sequence variation, a deeper understanding of biological conditions can be developed Also, through ortholog mapping, cross-species comparisons in expression in pig will also benefit medical studies using the pig as a model, such as wound healing or obesity Acknowledgments We would like to thank the USDA CSREES-NRI-2005-3560415618 and the ISU Center for Integrated Animal Genomics for funding this project. A USDA MGET 2001-52100-11506 Fellowship to O.C. is gratefully acknowledged.


Download ppt "ANEXdb: An Integrated Animal ANnotation and Microarray EXpression Database Oliver Couture 1,2, Keith Callenberg 2,3#, Neeraj Koul 4, Sushain Pandit 4,"

Similar presentations


Ads by Google