Presentation is loading. Please wait.

Presentation is loading. Please wait.

NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh.

Similar presentations


Presentation on theme: "NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh."— Presentation transcript:

1 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh Digital Libraries grant IIS 98-17444 (NSF,DARPA,NLM, LoC,NEH, NASA) http://db.cis.upenn.edu http://db.cis.upenn.edu/Research/provenance.html

2 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 2 Why Dont Scientists Use Databases Much? Much? Relational Thanks to: The ontologists and astronomers at Edinburgh The database and bio-informatics groups at Penn Aleri Inc. Special thanks (material stolen from) Sanjeev Khanna, Wang-Chiew, Keishi Tajima, Susan Davidson, Fidel Salas

3 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 3 Scientific Data is Ubiquitous 500 or so public molecular biology databases. –much discovery in silico Vast amounts of satellite imagery –maintaining it is very expensive Terabytes of astronomical data (not image data) Linguistic corpora are essential research tools -- also in terabytes

4 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 4 Relational DBs -- tabular data Id Name Address 123 H. SimpsonSpringfield 456L. SimpsonSpringfield 321A. JonesLondon IdCourseGrade 456GeometryA 123AlgorithmsD 456VoltaireA 321GeometryB 321AlgorithmsC TitleDeptTeacher AlgorithmsCompSciDr. Deadhead Voltaire FrenchProf. lePew GeometryMathDr. Obtuse Useful information is obtained by combining tables. Efficient algorithms for – comining and indexing tables – transaction processing (updates and multiple users) Relational databases are a multi giga-$ industry

5 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 5 Reasons for Mismatch Scientific data sets are too large (image data, huge analyses) Scientific data is too complex Relational databases dont work well with arrays and scientific computation Schema evolution and history are important Databases are too expensive

6 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 6 Swissprot -- a curated database (1 of ~100,000 entries) ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE //... OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988).... MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE... Data Metadata ???

7 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 7 ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). Hierarchical data. Order important. DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) Record (inadequate) of history

8 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 8 ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). Array indices (array operations?) OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. Tree data (recursive query processing?)

9 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 9 ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 AND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). Structure in comments = schema evolution

10 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 10 To turn Swissprot into tables requires: 20 - 30 tables –nothing extraordinary by relational standards, but –huge query to reconstruct original form Invented keys Queries on order and arrays Recursive query processing Also need to deal with schema evolution

11 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 11 Curated Databases Useful scientific databases are often curated : they are created/ maintained with a great deal of manual labour. select xyz from pqr where abc Database peoples idea of what happens What really happens DB1 DB2

12 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 12 Database Inter-dependence is Complex GERD TRRD GenBank Swissprot EpoDB TransFac GAIA BEAD

13 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 13 Three new topics Annotation –how do I annotate a data element, and how is this passed through queries? Archiving –how do we keep all the old versions of a database? Vertical partitioning. –combining databases and vector processing.

14 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 14 Data Annotation (Khanna, Tan) Some databases (e.g. biology and linguistics) are designed to accommodate annotations Also a need for ad hoc (unanticipated) annotations. –How are annotations communicated? –How are they passed through queries? No general techniques or principles.

15 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 15 Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar $$$French $$$Seafood $Chinese $ American Restaurant CostType Pacifica Soho Kitchen & Bar $Chinese $ American All Restaurants (View 1) Cheap Restaurants (View 2) Yummy chicken curry!! NYRestaurants (Source Table) Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Serves fine French Cuisine in elegant setting. Jackets required. Extensive wine list! Sharing annotations (courtesy of Wang-Chiew Tan)

16 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 16 Annotation looks simple but... Computing how an annotation should move through a query is intractable Equivalent queries may not carry annotations in the same way New insights are needed!

17 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 17 How do we Build Archival Databases? [Khanna, Tajima, Tan] Many scientific database keep archives. Its important to preserve the state of knowledge as it was in the past Archive frequently: space consuming Archive infrequently: delay in getting recent information published.

18 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 18 The dangers of electronic documents http://www.ornl.gov/hgmis/publicat/miscpubs/bioinfo/inf_rep2.html#AppenI APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully atomized sequence database is available (i.e., no data stored in ASCII text fields), none of the queries in this appendix can be answered.... APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully atomized sequence database is available (i.e., no data stored in ASCII text fields), none of the queries in this appendix can be answered.... APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully relationalized sequence database is available, none of the queries in this appendix can be answered.... APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully relationalized sequence database is available, none of the queries in this appendix can be answered.... Now: Then: Report of a DOE bioinformatics summit ca. 1994 (No archive/edition! No footnote!)

19 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 19 Examples from Bioinformatics Swissprot. New version produced every four months. –Old versions are kept. –Difficult to get at most recent data OMIM. New version produced every day –Old versions are not kept –Impossible to reconstruct past states of the data

20 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 20 Current approaches use diff need to preserve object continuity through time Version 1: Joe March South Street 12345 Jane May Pine Street 67890 Output of line diff (versions 1-2): 3,4c Jane May 9,10c Joe March Version 2: Jane May South Street 12345 Joe March Pine Street 67890 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Line Number

21 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 21 A Sequence of Versions

22 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 22 Pushing time down [Driscoll, Sarnak, Sleator, Tarjan: Making Data Structures Persistent. ]

23 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 23 Experimental Results (OMIM) Number of versions Size (bytes) x 10 6 XMill(archive) gzip(inc diff) version archive, inc diff Legend archive inc diff version compressed inc diff compressed archive Uncompressed Archive size is – 1.01 times diff repository size – 1.04 times size of largest version Compressed archive size is between 0.94 and 1 times compressed diff repository size gzip - unix compression tool XMill - XML compression tool

24 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 24 Experimental Results (Swissprot) Uncompressed Archive size is – 1.08 times diff repository size – 1.92 times size of largest version Compressed archive size is between 0.59 and 1 times compressed diff repository size gzip - unix compression tool XMill - XML compression tool Number of versions Size (bytes) x 10 6 archive XMill(archive) version inc diff gzip(inc diff) Legend archive inc diff version compressed inc diff compressed archive

25 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 25 The Bottom Line We have built an archiver, using XML as the base format We can build a year of archives (archive as often as you like) for a 14% increase on the size of the most recent database Based on keys -- preserves object history Works well with compression Obtaining an old archive is no more expensive than getting the current version.

26 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 26 Vertical partitioning An old idea revisited Fusion of array processing languages and database query languages Substantial use on Wall Street!!!

27 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 27 Conventional Storage Rows are stored contiguously. Order is not preserved (Horizontal partitioning) disk pages

28 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 28 Problems with conventional storage Unanticipated queries will probably read the whole database SELECT average sqrt(shoe-size) FROM employee WHERE hat-size > shoe-size (this only needs two fields) Order or rows is random and does not support order-sensitive functions: moving window averages, convolutions, etc.

29 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 29 Vertical partitioning (vectorisation) Columns are stored contiguously. Order is preserved (Vertical partitioning) disk pages

30 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 30 Advantages of Vertical Patitioning Faster queries. –A query that reads 2 columns in 100 does 2% of the i/o (i/o cost dominates) –A few columns can often reside in memory. Computation on order Can use both SQL and vector processing languages Downside: deletions are horribly expensive. –but deletions are uncommon in scientific DBs Vertical partitioning can also be performed on hierarchical structures -- like Swissprot -- and XML

31 NeSC, 25 April 2002 Why Dont Scientists Use Databases? 31 Many other issues Heterogeneous data integration – a perennial problem –can it be done by the end-users? Distributed query evaluation against redundant, constrained data. Data provenance Data streams and many more All these involve hard, fundamental problems in Computer Science


Download ppt "NeSC, 25 April 2002 Why Dont Scientists Use Databases? 1 Why Dont Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh."

Similar presentations


Ads by Google