Presentation is loading. Please wait.

Presentation is loading. Please wait.

Curated Databases Peter Buneman School of Informatics University of Edinburgh.

Similar presentations


Presentation on theme: "Curated Databases Peter Buneman School of Informatics University of Edinburgh."— Presentation transcript:

1 Curated Databases Peter Buneman School of Informatics University of Edinburgh

2 The Population of Corfu (2001) 107,879 (as of 2001 ) *** 93, , , , , ,043 approximately 110,000 approximately ,200 (2003 est) around 110, htm about 110,00 97,102 in , ,512 about 100, ,000 approximately 107,000 *** The only site to give attribution: ECDL2

3 These are both curated databases ECDL3

4 What is a curated database? A curated database is one that is maintained with a lot of human effort Curare: Latin “to care for” Prime concern is quality of data ECDL4

5 What is a database? (for the purposes of this talk) Any structured collection of data that is subject to change/revision – Ontologies – XML and other structured text files – Structured wikis – Standard relational and object-oriented databases ECDL5

6 Curated databases have interesting properties… A digital reference work. Traditional dictionaries, gazetteers, encyclopedia have been replaced by curated databases. Value lies in the organization and annotation of data Commonly constructed by copying parts of other (curated) databases. Rapidly increasing in scientific research. (> 1000 in molecular biology) Constantly checked/verified. Data quality and timeliness are important. Often group efforts. Produced by a dedicated organization or collaboration. Increasingly seen as “publications” by scientists. (You get kudos if someone uses your database – like a citation.) ECDL6

7 ... and they are very expensive Big physics (LHC) data [Movie] Book 1“Production” code/Curated data 10“Reliable” code / Curated data In $/€/£ per byte

8 A change for the better? Storage: Redundant Persistent Distributed Readable by people Clear standards for citation Historical record (old data is useful) Well understood ownership/IP Storage: Single-source Volatile Centralised Internal DBMS format No standards for citation No historical record Mind-boggling legal issues 20 th century libraries did some things better! ECDL8

9 Some computer science issues Archiving (CS usage) Provenance Annotation/citation Data cleaning All of these are intimately connected. For example, if you cite some part of a curated database, the version you cited should be available (archiving) ECDL9

10 CIA World FactbookUniprot Some well-known curated databases ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172: (1988). RN [2] RP SEQUENCE OF ND RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21: (1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN S GLOBULIN BETA SUBUNIT. FT CHAIN GAMMA CHAIN (ACIDIC). FT CHAIN DELTA CHAIN (BASIC). FT MOD_RES PYRROLIDONE CARBOXYLIC ACID. FT DISULFID INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT S -> E (IN REF. 2). FT CONFLICT E -> S (IN REF. 2). SQ SEQUENCE 480 AA; MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // ECDL10

11 Archiving / Database Preservation How do we preserve something that evolves (both in content and structure) Keep snapshots? – frequent: space consuming – infrequent: lose “history” Most curated databases have a hierarchical structure that we can exploit… ECDL11

12 A Sequence of Versions ECDL12

13 This relies on a deterministic / keyed model – there’s a unique path to every data item. Pushing time down ECDL13 [B., Khanna, Tajima, Tan, TODS 27,2 (2004)]

14 An initial experiment Grabbed the last 20 available versions of Swissprot XML-ized all of them Also recorded all OMIM versions for about 14 weeks (100 of them) Combined into archive XML format file by pushing time down. ECDL14

15 100 days of OMIM Size (bytes) x 10 6 XMill(archive) gzip(inc diff) version archive, inc diff Legend archive inc diff version compressed inc diff compressed archive Uncompressed Archive size is –  1.01 times diff repository size –  1.04 times size of largest version Compressed archive size is between 0.94 and 1 times compressed diff repository size gzip - unix compression tool XMill - XML compression tool ECDL15

16 ~ 5 years of UniProt Size (bytes) x 10 6 archive XMill(archive) version inc diff gzip(inc diff) Legend archive inc diff version compressed inc diff compressed archive Uncompressed Archive size is –  1.08 times diff repository size –  1.92 times size of largest version Compressed archive size is between 0.59 and 1 times compressed diff repository size ECDL16

17 Snapshots are immediate. Longitudinal/temporal queries are also easy Factbook Demography Andorra Liechtenstein China Economy Population [ ] * ** * ** [1990][1991] [2006] … Plot, by year, the population of Liechtenstein since ,24728,29228,476 ECDL17

18 A Working System Implemented by Heiko Müller For scale, we require external sorting of large XML files Designed and implemented by Ioannis Koltsidas Heiko Müller and Stratis Viglas Has a simple temporal query language Experimented with recent (HTML) versions of CIA world factbook ECDL18

19 What the archive looks like Afghanistan Communications Internet users 1,000 (2002) 30,000 (2005) NA Radios 167,000 (1999) Telephones - main lines in use 100,000 (2005) 280,000 (2005) 29,000 (1998) 33,100 (2002) … ECDL19

20 How did the population of China change from ? Population 1,284,303,705 (July 2002 est.) 1,286,975,468 (July 2003 est.) 1,298,847,624 (July 2004 est.) 1,306,313,812 (July 2005 est.) 1,313,973,713 (July 2006 est.) 1,321,851,888 (July 2007 est.) ECDL20

21 How did land area of countries change in ? … land 82,444 sq km 82,738 sq km … land 545,630 sq km 640,053 sq km; 545,630 sq km (metropolitan France) … ECDL21

22 What are the differences between the factbooks on 21/08/2007 and 10/09/2007? 30,000 (2005) 535,000 (2006) 1.4 million (2005) 2.52 million (2006) … ECDL22

23 Heiko Müller’s Xarch Examples of use with ⁻Ontologies ⁻XML files ⁻Relational databases Automatically converts RDBs into XML Efficiently extracts snapshots Simple temporal query language ECDL23

24 Provenance – a huge issue Where did this data come from? How did it get here? How was it constructed?... Two schools of research: Workflow (coarse-grain) provenance – a complete record of how some large scientific analysis/simulation was performed. Data (fine-grain) a record of how some small piece of data (in a larger databases) was produced ECDL24

25 Copy-paste, or C V (2001) 107,879 (as of 2001 ) *** 109, , ,200 (2003 est) 97,102 in ,880 Data provenance: an example ECDL25

26 “Where provenance” Possible explanations of how something was copied: This data item was extracted from location L1 in document D1 and placed in location L2 in document D2 or This data item was extracted from database D1 by query Q1 and placed in database D2 by update U2 (or some combination of the two) ECDL26

27 ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172: (1988). RN [2] RP SEQUENCE OF AND RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21: (1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN S GLOBULIN BETA SUBUNIT. FT CHAIN GAMMA CHAIN (ACIDIC). FT CHAIN DELTA CHAIN (BASIC). FT MOD_RES PYRROLIDONE CARBOXYLIC ACID. FT DISULFID INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT S -> E (IN REF. 2). FT CONFLICT E -> S (IN REF. 2). SQ SEQUENCE 480 AA; MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE // DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). FT CHAIN S GLOBULIN BETA SUBUNIT. FT CHAIN GAMMA CHAIN (ACIDIC). FT CHAIN DELTA CHAIN (BASIC). FT MOD_RES PYRROLIDONE CARBOXYLIC ACID. FT DISULFID INTERCHAIN (GAMMA-DELTA) (POTENTIAL). Where does this information come from? Which curator? Or was it the cited papers? Was it copied from some other DB? Where Provenance ECDL27

28 Copy-paste model of curated DBs (a) A biologist copies some UniProt records into her DB. (b) She fixes entries so that UniProt PTMs are not confused with hers. (c) She copies in some publication details from OMIM (d) She corrects a mistake in a PubMed publication number. [B. Chapman, Cheney, Sigmod ’06] Curated databases are not views!! ECDL28

29 A very simple copy-paste language (uses a “deterministic” tree model) How costly is it to record all this? (1) delete c5 from T; (2) copy S1/a1/y into T/c1/y; (3) insert {c2 : {}} into T; (4) copy S1/a2 into T/c2; (5) insert {y : {}} into T/c2; (6) copy S2/b3/y into T/c2/y; (7) copy S1/a3 into T/c3; (8) insert {c4 : {}} into T; (9) copy S2/b2 into T/c4; (10) insert {y : 12} into T/c4; ECDL29

30 How to reduce space Complete provenance: Record every update. Transactional provenance: Record the links at the end of some user-defined transaction (sequence of updates) Hierarchical (inferred) provenance. Only record a link if it cannot be inferred from the provenance of a higher node Taken together these provide a substantial saving on storage. Overhead comparable with the size of the DB in some realistic simulations ECDL30

31 (select A, 5 as B from R where A = 1) union (select * from R where A <> 1) delete from R where A = 1; insert into R values (1,5) update R set B = 5 where A = Query languages and where provenance ABAB [B., Cheney, Vansummeren, TODS 33,4, 2008] ECDL31

32 Other forms of provenance in query languages Why-provenance: why is a tuple in the output, or what parts of the input “contributed to” the tuple? [Widom et al] How-provenance: how (by what process) was this tuple constructed. [Tannen et al] Database Large, heterogeneous source Small part of source Complex program or process Simpler program/process Taken together, these are the “explanation”. “Piece” of data: data value, tuple.etc ECDL32

33 Workflow provenance Taken from [Cohen, et al DILS 2006] Each step S1... S4 is itself a workflow. How does one record an “enactment” of the workflow? How much “context” does one record? –from people –from databases that change Recent attempts to produce a general model –Open Provenance Model [Moreau et al. 2007] –Petri Net + Complex Object [Hidders et al.Inf Syst 2008] ECDL33

34 Provenance is very general issue Intrinsic to data quality. It is starting to be used in several areas of CS: – Semantics of update languages. – Probabilistic databases – Data integration – Debugging schema transformations – File/data synchronization – Program debugging (program slicing) – Security The fundamental problem is finding the right model/models – can we combine data and workflow models? ECDL34

35 Annotation – closely related to provenance Much of the activity of curators is the annotation of existing data. When we copy that data, we should also copy its annotations The propagation of annotation follows (where-) provenance But the story is more complicated because we often annotate views ECDL35

36 The Distributed Annotation Server (DAS) ECDL36

37 ECDL37

38 Annotating Databases GuinnessStout5.0Eire HeinekenPilsner5.0Netherlands Old JockAle6.7Scotland GuinnessStout7.5Nigeria FischerBlonde6.0France GuinnessStout HeinekenPilsner Old JockAle FischerBlonde GuinnessStout5.0Eire HeinekenPilsner5.0Netherlands FischerBlonde6.0France Stijn says this is not a beer π σ Polygen [Wang & Madnick VLDB 1990], DBNotes [Bhagwat et al, VLDB 2004] Concern is propagation of annotations from views to source and back. Again, there is an interesting theory Not strong

39 How do you cite something in a database? Many scientific databases ask you to cite them, but.. they don’t tell you how, or they tell you to give the URL, or they tell you to cite a paper about the database. Nutrition Education for Diverse Audiences [Internet]. Urbana (IL): University of Illinois Cooperative Extension Service, Illinet Department; [updated 2000 Nov 28; cited 2001 Apr 25]. Diabetes mellitus lesson; [about 1 screen]. Available from NLM Recommended Formats for Bibliographic Citation. Internet Supplement. NLM Technical report Bethesda, MD 20894, July ECDL39

40 What is a citation? Bard JB and Davies JA. Development, Databases and the Internet. Bioessays Nov; 17(11): [Location and descriptive information] Ann. Phys., Lpz Nature, 171, (We often want more than location) ECDL40

41 Automatically generating citations { DB=IUPHAR, Version=$v, Family=$f Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i} ← /Root[ ]/Version[Number=$’v, Editor=$?e, DOI=$.i, Date=$.d] /Data[ ] /Family[FamilyName=$’f] /Contributor-list/Contributor=$+a] /Receptor[ReceptorName=$’r] { DB=IUPHAR, Version=11, Family=Calcitonin, Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner}, Editor=Tony Harmar, Date=Jan 2006, DOI= } A rule: What gets generated (example): ECDL41

42 Other topics: Data quality and data cleaning Published data often looks clean but is intrinsically messy – “Dead” fields in the underlying data – Multiple syntactic conventions – Abuse of / confusion over formats & schema Human errors require human correction – Automate error detection rather than error correction Cleaning is an essential prerequisite in any integration or preservation task. ECDL42

43 Other topics: Evolution of Structure Curated DBs evolve from humble origins. Schemas are often wrong; they are – designed by people who don’t understand schemas – designed before the domain is fully understood Do ontologies help (you can build an ontology without worrying much about the schema) or do they defer the problem and make it worse? ECDL43

44 The larger (economic and social) issues Who will archive/curate curated databases? Should they be open-access? – who pays for their maintenance? What are the legal/IP issues? ECDL44

45 A case study: IUPHAR database (curated by Tony Harmar and team) “Standard” curated database Labour-intensive (hundreds of contributors) Valuable (supported by drug companies) Simple, clean structure – as seen by users 50m IUPHAR DCC

46 ECDL46

47 ECDL47

48 ECDL48

49 ECDL49

50 We wanted to use our archiver Our first task was to convert the database into a hierarchical structure (following the web presentation) so that we could archive it. We used the Prata XML (Fan et al) publishing software This had some unexpected benefits… ECDL50

51 … Y Agonist empty NO YES NO Human oxytocin Full Agonist p K d 9 YES NO NO Human [ 3 H]-oxytocin Full Agonist p K d 9, 42 … ECDL51

52 We can preserve all versions of the data (as intended) We can generate static web pages (less software, more efficient) We can make the database citable Tony can trace the history of entries Tony can generate an old-fashioned book (yes, he wants to do this!) We have a “community model” for data exchange The data got cleaned up in the process The representation information (required by archivists) is greatly simplified ECDL52

53 Selected pages from the book – generated by a 100-line style sheet 53

54 Our library will “host” the book, but not the database! 54

55 Centralized vs. distributed publishing 20 th century libraries provided robust, distributed dissemination and preservation of reference material Valuable information was lost in earlier “data centers”. Is this still happening? Replication and distribution has always been the best guarantee of preservation. We should do the same for curated databases – a database LOCKSS ? ECDL55

56 Many of the issues are non “technical” A good economic model for sustainability – Open access works for journal papers – Can it work for curated DBs? They require long-term support. And people who write reference manuals sometimes expect to make money out of them. Intellectual property in curated databases is a nightmare – legislation still largely based on the notion of copying. We can still help by providing good models of the processes in curating and publishing databases ECDL56


Download ppt "Curated Databases Peter Buneman School of Informatics University of Edinburgh."

Similar presentations


Ads by Google