Download presentation
Presentation is loading. Please wait.
1
RAD (RNA Abundance Database)
Stoeckert, C.J.Jr., Pizarro, A., Manduchi, E., Gibson, M., Brunk, B., Crabtree, J., Schug, J., Shen-Orr, S., Overton, G.C. A relational schema for array and non-array based gene expression data. Bioinformatics. In press. 2001 The Computational Biology and Informatics Laboratory
2
Issues Accurate experiment description Data preprocessing issues
clean-up calibration normalization other transformations Selecting, interpreting, and comparing experiments appropriately requires knowledge of how the experiments were performed and the samples that were used in sufficient detail to assess their quality and their degree of similarity. The most appropriate criteria for spot selection, normalization, etc., depend on the experiments under study and on the questions investigated.
4
RAD Multiple labs Multiple biological systems Multiple platforms
Multiple image quantification software RAD Expressed genes Differentially-expressed genes Class discovery Class prediction Gene networks
5
RAD versatility Platforms 2-channel microarrays Filter arrays
Affymetrix SAGE Image quantification software ScanAlyze GEMTools BioImage …
6
Views A “view” renames attributes of a low-level generic table for specific implementations. Common fields are specified as the same attributes for all implementations and implementation-specific fields rename generic attributes of the appropriate data type. These views are not the same as materialized views that provide precalculated values to improve database query performance.
7
A SpotResult View
8
RAD strengths sample description use of ontologies
information about array elements links to GUS ( storing of raw and processed data captures all available information; history and parameter tracking storing of public and proprietary data user-group-other read/write permissions The schema is compliant with the minimum annotations recommended by MGED. Controlled vocabularies: Taxonomy: the Taxon table uses the NCBI taxonomy obtained from GSDB in relational form. Anatomy: the Anatomy table is modified from the stage 28 (adult) mouse anatomy from the Mouse Gene Expression Database (MGD) at Jackson Laboratory to include human terms and expand description of systems such as hematopoiesis using medical textbooks as sources (e.g. the 37th edition of Gray’s Anatomy) as well as the expertise of the biologists at CBIL and in collaborating laboratories. The Anatomy hierarchy includes “cell lines” as a substructure for many different types of cells to distinguish immortalized from primary cells . Disease: the Disease table uses the KEGG representation of the CDC ICD-9 classification. The KEGG representation has associated MIM identifiers with many of the ICD-9 terms.
9
Information to be captured
Figure from: David J. Duggan et al. (1999) Expression Profiling using cDNA microarrays. Nature Genetics 21: 10-14
10
Categories of tables Experiment Raw Data Platform Algorithm Metadata
Processed Data
11
Experiment Tables B A Figure from:
David J. Duggan et al. (1999) Expression Profiling using cDNA microarrays. Nature Genetics 21: 10-14
12
Experiment Tables (A) Label Sample Treatment Disease Devel. Stage
Hybridization Conditions Label Sample Treatment Disease Devel. Stage ExperimentSample Taxon Anatomy RelExperiments Exp.ControlGenes ControlGenes Experiment ExpGroups Groups
13
Experiment Tables (B) Views Experiment ExpImageImp ExpResultImp
PhosphorImager, ScanAlyzeImage, GEMImage, StanfordScanner, AffymetrixScanner, SAGESequence, … ExpImageImp ExpResultImp BioImage, ScanAlyzeAnalysis, GEMResult, StanfordAnalysis, AffymetrixAnalysis, SAGEAnalysis, …
14
Platform Tables Figure from:
David J. Duggan et al. (1999) Expression Profiling using cDNA microarrays. Nature Genetics 21: 10-14
15
Platform Tables SpotFamilyImp SpotImp Array
16
SpotFamily views (comparisons)
SAGESpotFamily spot_family_id tag ext_db_id cluster_id … GEMSpotFamily spot_family_id ext_db_id source_id plate_id plate_row plate_column … AffymetrixSpotFamily spot_family_id ext_db_id accession … Each is a view of SpotFamily table Link to data with spot_family_id Integrate through gene index ( GUS EST assemblies mRNA
17
Raw Data Tables SpotImp SpotResultImp SpotFamilyImp ExpResultImp
SpotFamilyResult
18
Processed Data/Algorithm Tables
SpotResultImp raw spot value SpotFamilyResult summary of raw values Algorithm type of program used AlgImplementation actual program used AlgoInvocation usage of the algorithm AlgParamKeyType parameter data type AlgParamKey parameter description AlgParam value used SpotResAnalysis processed spot result SpotFamResAnalysis processed spot fam res AnalysisType type of processing
19
Query RAD by Sample or by Experiment
Access by Experiment groups Sample info ontologies Image info
22
What genes are expressed in the top 20% of normal
B-lymphocytes and mapped to Chromosome 19?
23
The allgenes (GUS) index provides annotation of array elements in RAD
EST clustering and assembly Different representations of the same RNA are identified. EST/mRNA annotations are combined. Consensus sequence is annotated (e.g., gene function).
24
GUS: Genomics Unified Schema
Ontologies GO Species Tissue Dev. Stage Genes, gene models STSs, repeats, etc Cross-species analysis Genomic Sequence RAD RNA Abundance DB Characterize transcripts RH mapping Library analysis Cross-species analysis DOTS Transcribed Sequence Special Features Transcript Expression Arrays SAGE Conditions Ownership Protection Algorithm Evidence Similarity Versioning under development Domains Function Structure Cross-species analysis Protein Sequence Pathways Networks Representation Reconstruction
25
Different Views of RAD Focused annotation of specific organisms and biological systems: organisms biological systems Endocrine pancreas Human Mouse CNS RAD RAD Plasmodium falciparum Hematopoiesis *not drawn to scale*
28
Continuing Work and Future Issues
Analysis perspective: ontologies data preprocessing cross-platform comparisons utilize other types of high-throughput data (e.g. protein expression) DB perspective: capture conclusions from analyses in a structured way integrate other types of high-throughput data
29
RAD: www.cbil.upenn.edu/RAD2
Elisabetta Manduchi Angel Pizarro Shannon McWeeney Allgenes: Brian Brunk Ed Uberbacher, ORNL Jonathan Crabtree Doug Hyatt. ORNL Sharon Diskin Joan Mazzarelli Jonathan Schug EPConDB: Greg Grant Klaus Kaestner, Penn Phillip Le Marie Scearce, Penn Debbie Pinney Doug Melton, Harvard Alan Permutt, Wash U MGED:
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.