Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scientific Data Management Why is it so hard? Nat Goodman Institute for Systems Biology September 2003.

Similar presentations


Presentation on theme: "Scientific Data Management Why is it so hard? Nat Goodman Institute for Systems Biology September 2003."— Presentation transcript:

1 Scientific Data Management Why is it so hard? Nat Goodman Institute for Systems Biology September 2003

2 BRIITE Presentation Sep 2003Nat Goodman Slide 1 What is scientific data management? What is science? What is science? What is scientific data? What is scientific data? Content areas Content areas Bench/cage/bedside to publication and back Bench/cage/bedside to publication and back Small vs. big science Small vs. big science What do we want to manage? What do we want to manage? Store/fetch Store/fetch Analysis Analysis Sharing Sharing

3 BRIITE Presentation Sep 2003Nat Goodman Slide 2 Scientific data sub-problems Basic vs. clinical Basic vs. clinical For basic: For basic: Pre-publicationPublic Post- publication Small science ??? LiteratureDatabases Access to databases Data integration Big science LIMS Analysis pipelines

4 BRIITE Presentation Sep 2003Nat Goodman Slide 3 Concrete is cheap - abstract is dear ConcreteAbstract Has contact info on computer Has email address in Outlook Address Book One sequence from clone (IMAGE:30346642) Sequence of human gene (neurexin-3) Sequence of gene (neurexins ) Friends

5 BRIITE Presentation Sep 2003Nat Goodman Slide 4 Human genes over time 1)name (HD), cytogenetic position (4p16.3), alleles 2)name, cytogenetic position, mRNA sequence, coding region ( protein sequence) 3)name, cytogenetic position, splice forms splice form = mRNA sequence, coding region name, cytogenetic position, genomic coordinates, exons, splice forms name, cytogenetic position, genomic coordinates, exons, splice forms exon = genomic stretch splice form =(list of included exons ( mRNA sequence) OR mRNA sequence), coding region included exon = all or part of exon 5)name, genomic coordinates, exons, splice forms, alleles, gene families (paralogs), othologs

6 BRIITE Presentation Sep 2003Nat Goodman Slide 5 Complications Granularity issues Granularity issues Some data abstract [Dean C et al. Nat Neurosci. 2003 Jul;6(7):708-16] neurexins nucleate assembly of cytoplasmic scaffold Some data abstract [Dean C et al. Nat Neurosci. 2003 Jul;6(7):708-16] neurexins nucleate assembly of cytoplasmic scaffold Some data specific (hypothetical example) mutation in specific base of human neurexin-3 affects specific splice form Some data specific (hypothetical example) mutation in specific base of human neurexin-3 affects specific splice form Attribution and evidence Attribution and evidence Unstable data Unstable data Conflicting data common Conflicting data common Errors common Errors common – Big databases notoriously error-prone – No one has ever calibrated literature – may be just as bad! Abstract data evolves Abstract data evolves Versioning, esp., for data derived from public databases Versioning, esp., for data derived from public databases What good is all this unless application programs can exploit? What good is all this unless application programs can exploit?

7 BRIITE Presentation Sep 2003Nat Goodman Slide 6 Database of all genes Gene concept evolving at different rates in different organisms Gene concept evolving at different rates in different organisms Database of all genes must transcend these differences Database of all genes must transcend these differences 1)Least common denominator 2)Union of complexity 3)Something fancy – variable complexity – abstraction!

8 BRIITE Presentation Sep 2003Nat Goodman Slide 7 Finding the sweet spot Software abstraction Scientific abstraction Custom database for one laboratory project Generic database for all of science Generic LIMS for molecular projects Interaction database Gene database

9 BRIITE Presentation Sep 2003Nat Goodman Slide 8 Data models Data model can mean Data model can mean 1)Design of specific database or data type e.g.,GenBank sequence data model 2) 2)Computer language or system for encoding designs e.g.,SQL data model ORACLE data model 3)Formalism for expressing data designs e.g.,relational data model 3)Formalism for expressing data designs e.g.,relational data model object data model Different data models (in sense #2 or #3) have different expressive power Different data models (in sense #2 or #3) have different expressive power

10 BRIITE Presentation Sep 2003Nat Goodman Slide 9 Data model history Relational model (Ted Codd, circa 1968) Relational model (Ted Codd, circa 1968) Demolished all predecessors – start of modern era Demolished all predecessors – start of modern era Entity-relationship, functional, many others Entity-relationship, functional, many others Emerged and competed briefly in 80s Emerged and competed briefly in 80s Object models Object models Emerged and competed briefly in 90s Emerged and competed briefly in 90s Object-relational persists (behind closed doors) Object-relational persists (behind closed doors) Meanwhile, in AI and knowledge representation Meanwhile, in AI and knowledge representation Semantic data models – much more powerful Semantic data models – much more powerful Never accepted by database folks Never accepted by database folks

11 BRIITE Presentation Sep 2003Nat Goodman Slide 10 Relational data model is wimpy Flat tables very limiting Flat tables very limiting Current gene data model Current gene data model name, cytogenetic position, genomic coordinates, exons, splice forms needs at least 5 tables Gene (id, name, cyto_position, chrom, start, length, strand) Exon (id, start, length) SpliceForm (id, name, gene, coding_start, coding_length) SpliceForm_type1 (spliceform, sequence) SpliceForm_type2 (spliceform, ordinal, exon, start, length) plus stored procedures or application code to splice exons and translate mRNA into protein Database of all genes needs several such models plus something that glues them together Database of all genes needs several such models plus something that glues them together

12 BRIITE Presentation Sep 2003Nat Goodman Slide 11 Better data models better databases Software abstraction Scientific abstraction Custom database for one laboratory project Generic database for all of science sweet spot More powerful data models

13 BRIITE Presentation Sep 2003Nat Goodman Slide 12 Therapeutic agents in mouse models Agents: drugs, lifestyle, environment Agents: drugs, lifestyle, environment Mouse models Mouse models Study design Study design Treatment arms Treatment arms – agent, formulation, route of administration, dose, schedule Measures Measures – protocol, schedule, reporting Endpoints Endpoints Study itself, i.e., one running of the design Study itself, i.e., one running of the design How many animals per arm How many animals per arm Deviations from design Deviations from design Results Results

14 Does it help? Data from Schilling G et al. Neurosci Lett 2001 Nov 27;315(3):149-53. Ferrante RJ et al. J Neurosci 2002 Mar 1;22(5):1592-9 R6/2 N171-82Q R6/2N171-82Q Fancy database cant compensate for incomparable data

15 BRIITE Presentation Sep 2003Nat Goodman Slide 14 Why is it so hard? Vast field Vast field Complex, abstract, conflicting, evolving, incomparable data Complex, abstract, conflicting, evolving, incomparable data Relational data model barely copes Relational data model barely copes Smarter data models may be futile unless application programs get smarter, too Smarter data models may be futile unless application programs get smarter, too


Download ppt "Scientific Data Management Why is it so hard? Nat Goodman Institute for Systems Biology September 2003."

Similar presentations


Ads by Google